Aws-cli: AWS S3 sync does not sync all the files

Created on 18 Apr 2018  ·  44Comments  ·  Source: aws/aws-cli

We have several hundred thousand files and S3 reliably syncs files. However, we have noticed that there were several files which were changed about a year ago and those are different but do not sync or update.

Both source and destination timestamps are also different but the sync never happens. S3 has the more recent file.

Command is as follows
aws s3 s3://source /local-folder --delete

All the files that do not sync have the same date but are spread across multiple different folders.

Is there an S3 touch command to change the timestamp and possibly get the files to sync again?

feature-request s3 s3sync s3syncstrategy

Most helpful comment

I can't believe this ticket was not closed some time ago. As far as I can tell, it works as designed, but users (including me) make assumptions about how it should work and are then surprised when it doesn't behave how they expected.

  • When a file is synced or copied _to_ s3, the timestamp it receives on the bucket is the date it was copied, which is _always_ newer than the date of the source file. This is just how s3 works.
  • Files are only synced if the size changes, or the timestamp on the target is _older_ than the source.
  • This means that if source files are updated but the size of the files remains unchanged and the dates on those changed files pre-date when they were last copied, s3 sync will not sync them again.
  • Using --exact-timestamps _only_ works when copying from s3 to local. It is deliberately not enabled for local to s3 because the timestamps are _never_ equal. So setting it when syncing from local to s3 has no effect.
  • I don't think s3 calculates hashes for uploaded files, so there's no way of avoiding file size and last uploaded date as checks.

Bottom line is that it works as intended, but there are various use cases where this is not desirable. As mentioned above I've worked around it using s3 cp --recursive

All 44 comments

You can possibly use --exact-timestamps to work around this, though that may result in excess uploads if you're uploading.

To help in reproducing, could you get me some information about one of the files that isn't syncing?

  • What is the exact file size locally?
  • What is the exact file size in S3?
  • What is the last modified time locally?
  • What is the last modified time in S3?
  • Is the local file a symlink / behind a symlink?

Example command run
aws s3 sync s3://bucket/ /var/www/folder/ --delete

Several files are missing
Exact local size: 2625
Exact s3: 2625
Exact time stamp local: 06-Jan-2017 9:32:31
Exact time stamp s3: 20-Jun-2017 10:14:57
normal file in S3 and local

There are several cases like that in a list of around 50,000 files. However all the missing in sync are various times on 20 Jun 2017.

Using --exact-timestamps shows much more files to download although they are exactly the same contents. However they are still missing the ones in example above.

same issue here.
aws s3 sync dist/ s3://bucket --delete did not upload s3://bucket/index.html with dist/index.html

dist/index.html and s3://bucket/index.html have the same file size, but their modify time are different.

actually, some times awscli did upload the file, but some times not

Same here, --exact-timestamps doesn't help - index.html is not overwritten.

We experienced this issue was well today/last week. Again index.html is the same file size, but the contents and modified times are different.

Is anybody aware of a workaround for this?

I just ran into this. Same problem as reported by @icymind and @samdammers: the contents of my (local) index.html file had changed, but its file size was the same as the earlier copy in S3. The {{aws s3 sync}} command didn't upload it. My "workaround" was to delete index.html from S3, and then run the sync again (which then uploaded it as if it were a new file, I guess).

Server: EC2 linux
Version: aws-cli/1.16.108 Python/2.7.15 Linux/4.9.62-21.56.amzn1.x86_64 botocore/1.12.98


After aws s3 sync running over 270T of data I lost few GB of files. Sync didn't copy files with special characters at all.

Example of file /data/company/storage/projects/1013815/3.Company Estimates/B. Estimates

Had to use cp -R -n

same issue here xml file of the same size but different timestamp not synced correctly

I was able to reproduce this issue

bug.tar.gz
download attached tar file and then

tar -zxvf bug.tar.gz
aws s3 sync a/ s3://<some-bucket-name>/<some_dir>/ --delete
aws s3 sync b/ s3://<some-bucket-name>/<some_dir>/ --delete

you'll see that even though repomd.xml in directories a and b differ in contents and timestamps
attempting to sync b doesn't do anything

Tested on
aws-cli/1.16.88 Python/2.7.15 Darwin/16.7.0 botocore/1.12.78
aws-cli/1.16.109 Python/2.7.5 Linux/3.10.0-693.17.1.el7.x86_64 botocore/1.12.99

im seeing the same issue. trying to sync a directory of files from s3 where one file was updated to a local directory. that file does not get updated in the local directory

I'm seeing this too. In my case it's a react app with index.html that refers to generated .js files. I'm syncing them with the --delete option to delete old files which are no longer referred to. The index.html is sometimes not uploaded, resulting in an old index.html which points to .js files which no longer exist.

Hence my website stops working !!!

I'm currently clueless as to why this is happening.

Does anyone have any ideas or workarounds ?

We have the same problem, but just found a workaround. I know, it is not the best way, but it works:

aws s3 cp s3://SRC s3://DEST ...
aws s3 sync s3://SRC s3://DEST ... --delete

It seems to us, that the copy is working fine, so first we copy after that we use the sync command to delete files, which are no longer present.
Hope that the issue will be fixed asap.

I added --exact-timestamps to my pipeline and problem hasn't recurred. But, it was intermittent in the first place so I can't be sure it fixed it. If it happens again I'll go with @marns93 's suggestion.

We've met this problem and --exact-timestamps resolves our issue. I'm not sure if it's exactly the same problem.

I'm seeing this issue, and it's very obvious because each call only has to copy a handful (under a dozen) files.

The situation in which it happens is just like reported above: if the folder being synced into contains a file with different file contents but identical file size, sync will skip copying the new updated file from S3.

We ended up changing scripts to aws s3 cp --recursive to fix it, but this is a nasty bug -- for the longest time we thought we had some kind of race condition in our own application, not realizing that aws-cli was simply choosing not to copy the updated file(s).

I saw this as well with an html file

aws-cli/1.16.168 Python/3.6.0 Windows/2012ServerR2 botocore/1.12.158

I copy pasted the s3 sync command from a GitHub gist and it had --size-only set on it. Removing that fixed the problem!

Just ran into this issue with build artifacts being uploaded to a bucket. Our HTML tended to only change hash codes for asset links and so size was always the same. S3 sync was skipping these if the build was too soon after a previous one. Example:

10:01 - Build 1 runs
10:05 - Build 2 runs
10:06 - Build 1 is uploaded to s3
10:10 - Build 2 is uploaded to s3

Build 2 has HTML files with a timestamp of 10:05, however the HTML files uploaded to s3 by build 1 have a timestamp of 10:06 as that's when the objects were created. This results in them being ignored by s3 sync as remote files are "newer" than local files.

I'm now using s3 cp --recursive follow by s3 sync --delete as suggested earlier.

Hope this might be helpful to someone.

I had the same issue earlier this week; I was not using --size-only. Our index.html was different by a single character (. went to #), so the size was the same, but the timestamp on s3 was 40 minutes earlier than the timestamp of the new index.html. I deleted the index.html as a temporary workaround, but it's infeasible to double check every deployment.

The same here, files with the same name but with different timestamp and content are not synced from S3 to local and --delete does not help

We experience the same issue. An index.html with same size but newer timestamp is not copied.

This issue was reported over a year ago. Why is it not fixed?

Actually it makes the snyc command useless.

exact-time

--exact-timestamps fixed the issue

I am also effected by this issue. I added --exact-timestamps and the issue seemed to fix the files i was looking at. i have not done an exhaustive search. I have on the order of 100k files and 20gb, a lot less than the others in here.

I have faced the same issue, aws s3 sync skip some files, even with different contents and different dates. The log shows that those skipped files are synced but actually not.
But when I run aws s3 sync again, those files got synced. Very weird!

I had this issue when building a site with Hugo and I finally figured it out. I use submodules for my Hugo theme and was not pulling them down on CI. This was causing warnings in Hugo but not failures.

# On local
                   | EN
-------------------+-----
  Pages            | 16
  Paginator pages  |  0
  Non-page files   |  0
  Static files     |  7
  Processed images |  0
  Aliases          |  7
  Sitemaps         |  1
  Cleaned          |  0

# On CI
                   | EN  
-------------------+-----
  Pages            |  7  
  Paginator pages  |  0  
  Non-page files   |  0  
  Static files     |  2  
  Processed images |  0  
  Aliases          |  0  
  Sitemaps         |  1  
  Cleaned          |  0  

Once I updated the submodules everything worked as expected.

We've also been affected by this issue, so much so that a platform went down for ~18 hours after a new vendor/autoload.php file didn't sync, and was out of date with vendor/composer/autoload_real.php so the whole app couldn't load.

This is a _very_ strange problem, and I can't believe the issue has been open for this long.

Why would a sync not use hashes instead of last modified? Makes 0 sense.

For future Googlers, a redacted error I was getting:

PHP message: PHP Fatal error:  Uncaught Error: Class 'ComposerAutoloaderInitXXXXXXXXXXXXX' not found in /xxx/xxx/vendor/autoload.php:7
Stack trace:
#0 /xxx/xxx/bootstrap/app.php(3): require_once()
#1 /xxx/xxx/public/index.php(14): require('/xxx/xxx...')
#2 {main}
  thrown in /xxx/xxx/vendor/autoload.php on line 7" while reading response header from upstream: ...
---

The same problem, not all files are synced, --exact-timestamps didn't help.

aws --version
aws-cli/1.18.22 Python/2.7.13 Linux/4.14.152-127.182.amzn2.x86_64 botocore/1.15.22

I cannot believe this ticket open so long ... same problem here, where is Amazon's customer obsession?

I can't believe this ticket was not closed some time ago. As far as I can tell, it works as designed, but users (including me) make assumptions about how it should work and are then surprised when it doesn't behave how they expected.

  • When a file is synced or copied _to_ s3, the timestamp it receives on the bucket is the date it was copied, which is _always_ newer than the date of the source file. This is just how s3 works.
  • Files are only synced if the size changes, or the timestamp on the target is _older_ than the source.
  • This means that if source files are updated but the size of the files remains unchanged and the dates on those changed files pre-date when they were last copied, s3 sync will not sync them again.
  • Using --exact-timestamps _only_ works when copying from s3 to local. It is deliberately not enabled for local to s3 because the timestamps are _never_ equal. So setting it when syncing from local to s3 has no effect.
  • I don't think s3 calculates hashes for uploaded files, so there's no way of avoiding file size and last uploaded date as checks.

Bottom line is that it works as intended, but there are various use cases where this is not desirable. As mentioned above I've worked around it using s3 cp --recursive

@jam13 thanks for the explanation, now it all makes sense in hindsight!

Nevertheless, I'd argue that it's currently poorly documented (I would have expected a fat red warning in the documentation stating that --exact-timestamps only works _from s3 to local_ and also for the s3 cli to just bail out instead of silently ignoring the parameter) and an optional hash-based comparison mode is necessary to implement a reliably working synchronisation mode.

Yes, the documentation isn't great, and silently ignoring options is very unhelpful. The absence of any management or even official comments on this ticket from AWS over the last 2 years also speaks volumes.

@jam13 I digged into some documentations, and find out I need --exact-timestamps to circumvent some issues from s3 to local. Thanks!

@kyleknap @KaibaLopez @stealthycoin any update on this one?

I can't believe this ticket was not closed some time ago. As far as I can tell, it works as designed, but users (including me) make assumptions about how it should work and are then surprised when it doesn't behave how they expected.

* When a file is synced or copied _to_ s3, the timestamp it receives on the bucket is the date it was copied, which is _always_ newer than the date of the source file. This is just how s3 works.

* Files are only synced if the size changes, or the timestamp on the target is _older_ than the source.

* This means that if source files are updated but the size of the files remains unchanged and the dates on those changed files pre-date when they were last copied, s3 sync will not sync them again.

* Using `--exact-timestamps` _only_ works when copying from s3 to local. It is deliberately not enabled for local to s3 because the timestamps are _never_ equal. So setting it when syncing from local to s3 has no effect.

* I don't think s3 calculates hashes for uploaded files, so there's no way of avoiding file size and last uploaded date as checks.

Bottom line is that it works as intended, but there are various use cases where this is not desirable. As mentioned above I've worked around it using s3 cp --recursive

s3 does hash the objects, but not in an entirely knowable way if you are not the uploader, and stores this as the familiar ETag. The problem is that the ETag depends on the number of chunks and the chunk size used when the file was uploaded. If you're not the uploader, you probably don't know the chunk size (but can get the number of chunks from the ETag). I don't know why it is done this way.

This is probably working as intended, but not working as it should. It should be trivial to check whether a file has changed

It’s just a huge gotcha for people to unexpectedly experience out of sync
data. There are 100 different workarounds which could save everyone here
the time of reading this ticket, along with the time spent discovering this
was an issue in their source code. Why can’t they do one of these?

On Tue, Apr 14, 2020 at 1:57 PM Keith Kelly notifications@github.com
wrote:

I can't believe this ticket was not closed some time ago. As far as I can
tell, it works as designed, but users (including me) make assumptions about
how it should work and are then surprised when it doesn't behave how they
expected.

  • When a file is synced or copied _to_ s3, the timestamp it receives on the bucket is the date it was copied, which is _always_ newer than the date of the source file. This is just how s3 works.

  • Files are only synced if the size changes, or the timestamp on the target is _older_ than the source.

  • This means that if source files are updated but the size of the files remains unchanged and the dates on those changed files pre-date when they were last copied, s3 sync will not sync them again.

  • Using --exact-timestamps _only_ works when copying from s3 to local. It is deliberately not enabled for local to s3 because the timestamps are _never_ equal. So setting it when syncing from local to s3 has no effect.

  • I don't think s3 calculates hashes for uploaded files, so there's no way of avoiding file size and last uploaded date as checks.

Bottom line is that it works as intended, but there are various use cases
where this is not desirable. As mentioned above
<#m_8540343689970969812_issuecomment-534061850> I've worked around it
using s3 cp --recursive

s3 does hash the objects, but not in an entirely knowable way
https://teppen.io/2018/10/23/aws_s3_verify_etags/, and stores this as
the familiar ETag https://en.wikipedia.org/wiki/HTTP_ETag. The problem
is that the ETag depends on the number of chunks and the chunk size that
the file was uploaded in. If you're not the uploader, you probably don't
know the chunk size (but can get the number of chunks from the ETag). I
don't know why it is done this way.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/aws/aws-cli/issues/3273#issuecomment-613677369, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/ADUA4NKJMCUSGTNAAITGPXTRMTE2NANCNFSM4E3JNHPQ
.

>

...tom

Had the same issue. Solved it with changing the source bucket policy to:

 "Action": [
                "s3:*"
            ],

I had the problem with both cp --recursive and sync.
This solved it all. I had two actions which should have worked just fine, but didn't. Give it a try and let me know if it solved your problem.

Chiming in here to say that I've also been having the issue with sync. The only reason I noticed was because I was sealing and verifying MHLs on both ends. sync wouldn't work, and I was missing about 60 GB out of 890 GB, trying to go through, folder by folder. Then I found this thread and tried cp --recursive and the data started flowing again. Will verify the MHL one last time once I get the rest of this data.

I wrote a script to reproduce the problem, I use:
aws-cli/1.18.34 Python/2.7.17 Darwin/19.4.0 botocore/1.13.50

If you execute the script, you will see that after the upload of the change, the same change is not downloaded anymore. This is the script:

#!/bin/bash
PROFILE=foobar #PUT YOUR PROFILE HERE
BUCKET=baz123  #PUT YOUR BUCKET HERE

mkdir -p test/local
mkdir -p test/s3

cat >test/s3/test.json <<EOF
{
  "__comment_logging": "set cookie expiration time of aws split, examples '+1 hour', '+5 days', '+100 days'",
  "splitCookieExpiration": "+3 hours"
}
EOF

#UPLOAD
aws --profile=$PROFILE s3 sync --delete test/s3 s3://$BUCKET/ 
#DOWNLOAD
aws --profile=$PROFILE s3 sync --delete s3://$BUCKET/ test/local


#CHANGE 
cat >test/s3/test.json <<EOF
{
  "__comment_logging": "set cookie expiration time of aws split, examples '+1 hour', '+5 days', '+100 days'",
  "splitCookieExpiration": "+2 hours"
}
EOF


#UPLOAD
aws --profile=$PROFILE s3 sync --delete test/s3 s3://$BUCKET/ 
#DOWNLOAD
aws --profile=$PROFILE s3 sync --delete s3://$BUCKET/ test/local

@htrappmann Please read @jam13 answer https://github.com/aws/aws-cli/issues/3273#issuecomment-602514439 before — it's not a bug, it's a feature!

Thanks for the hint @applerom, but I really can not understand how @jam13 declares it as "works as designed". A sync tool should be designed to keep source and destination equal, and this is just not given with this sync. Which renders it useless for many applications.

Also if the file size is unchanged but source time stamp is newer also no sync takes place, like in my example script.

Thanks for the hint @applerom, but I really can not understand how @jam13 declares it as "works as designed". A sync tool should be designed to keep source and destination equal, and this is just not given with this sync. Which renders it useless for many applications.

Also if the file size is unchanged but source time stamp is newer also no sync takes place, like in my example script.

That does look like it's doing the wrong thing doesn't it.

I ran a couple of other tests to see what I actually needed to do to get the download to occur:

ls -l test/local/test.json
aws s3 sync --delete s3://$BUCKET/ test/local
touch -m -t 201901010000 test/local/test.json
ls -l test/local/test.json
aws s3 sync --delete s3://$BUCKET/ test/local
touch test/local/test.json
ls -l test/local/test.json
aws s3 sync --delete s3://$BUCKET/ test/local

When changing the file modification time to last year, the s3 sync still does not download the file, so it's not simply a timezone issue.

When changing the modification time to now (so the local file is newer than the remote), the s3 sync _does_ download the file!

I couldn't make sense of that, so I checked the docs, which state (when describing the --exact-timestamps option):

The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.

Using --exact-timestamps for download does work as expected (any difference in the timestamps results in a copy), but this default does seem backwards to me.

Maybe instead of saying "works as designed" I should have said "works as documented".

@jam13 Wow that's so odd, and I thought it is a confusion in the documentation!
But if this is the new way of fixing bugs, by just explicitly put them in the documentation ...

@jam13

I'm not sure if we can rule out timezone issue.
Every day, when I make the first change in s3 console, and sync aws s3 sync s3://$BUCKET ., it syncs. If I make another change to the file, and then sync, it doesn't sync.
But it works the next day.

This makes me rethink if it could be because of the timezone.

So checked a bit more about the touch -m command that you had mentioned above.

touch -m -t 201901010000 test/local/test.json
When changing the file modification time to last year, the s3 sync still does not download the file, so it's not simply a timezone issue.

The touch command above only backdates the mtime. It does not (and cannot) backdate the ctime.
Does the S3 cli maybe use the ctime?

$ touch file
$ stat -x file
  File: "file"
  Size: 0            FileType: Regular File
  ...
  ...
Access: Mon Jul 20 21:59:11 2020
Modify: Mon Jul 20 21:59:11 2020
Change: Mon Jul 20 21:59:11 2020

$ touch -m -t 201901010000 file
$ stat -x file
  File: "file"
  Size: 0            FileType: Regular File
  ...
  ...
Access: Mon Jul 20 21:59:11 2020
Modify: Tue Jan  1 00:00:00 2019
Change: Mon Jul 20 22:01:48 2020

I think file syncs should guarantee the files locally, and remotely are the same. I don't think i'm being unfair in saying that. I think aws s3 sync is more of an update, than a sync. I'm now going to change every implementation of aws s3 sync to aws s3 cp --recursive.

Thanks @jam13 for the explaination at https://github.com/aws/aws-cli/issues/3273#issuecomment-602514439

Was this page helpful?
0 / 5 - 0 ratings