Aws-cli: aws s3 ls - find files by modified date?

Created on 21 Jan 2015 · 87Comments · Source: aws/aws-cli

Hi,
We'd like to be able to search a bucket with many thousands (likely growing to hundreds of thousands) of objects and folders/prefixes to find objects that were recently added or updated. Executing aws s3 ls on the entire bucket several times a day and then sorting through the list seems inefficient. Is there a way to simply request a list of objects with a modified time <, >, = a certain timestamp?

Also, are we charged once for the aws s3 ls request, or once for each of the objects returned by the request?

New to github, wish I knew enough to contribute actual code...appreciate the help.

guidance

Source

ChrisSLT

👍34

Most helpful comment

@jwieder This doesn't help user decrease number of list calls to s3. Say that every day you store ~1000 news articles in a bucket. Then on client side want to get articles for last 3 days by default (and more only if explicitly requested). Having to fetch a list of all the articles since the beginning of time, say 100k, takes time and accrues network costs (because a single list call will return only up to 1000 items). It would be much nicer to be able to say "Give me a list of items created/modified since 3 days ago".

PuchatekwSzortach on 18 Jan 2016

👍78

All 87 comments

The S3 API does not support this, so the only way to do this just using S3 is to do client side sorting.

As far as S3 pricing, we use a ListObjects request which returns 1000 objects at a time. So you will be charged for a LIST request per every 1000 objects when using aws s3 ls.

Another alternative is to store an auxiliary index outside of S3, e.g dynamodb. Let me know if you have any other questions.

jamesls on 27 Jan 2015

👎176 👍2

Thank you

ChrisSLT on 27 Jan 2015

Although this functionality appears to remain absent from aws-cli, its pretty easy to script it in bash. For example:

#!/bin/bash
DATE=$(date +%Y-%m-%d)
aws s3 ls s3://bucket.example.com/somefolder/ | grep ${DATE}

jwieder on 15 Jan 2016

👎23 👍3

PuchatekwSzortach on 18 Jan 2016

👍78

Exactly!

On Sun, Jan 17, 2016 at 11:53 PM, PuchatekwSzortach <
[email protected]> wrote:

@jwieder https://github.com/jwieder This doesn't help user decrease
number of list calls to s3. Say that every day you store ~1000 news
articles in a bucket. Then on client side want to get articles for last 3
days by default (and more only if explicitly requested). Having to fetch a
list of all the articles since the beginning of time, say 100k, takes time
and accrues network costs (because a single list call will return only up
to 1000 items). It would be much nicer to be able to say "Give me a list of
items created/modified since 3 days ago".

—
Reply to this email directly or view it on GitHub
https://github.com/aws/aws-cli/issues/1104#issuecomment-172425517.

ChrisSLT on 18 Jan 2016

👍8

@PuchatekwSzortach @ChrisSLT You're right, sorry for my lame reply; and I agree this sort of functionality would be very helpful in aws-cli. The combination of leaving this basic feature out and billing for file listings is highly suspect. Until AWS stops penny-pinching and introduces listing by file properties, here's another idea that I've used that is more relevant to this thread then my 1st reply: For files that need to be tracked in this way, files are named with a timestamp. A list of files is stored in a local text file (or could be db if you have gazillions of files to worry about). Searching for a date then involves opening the file, looking for filenames that match the today's date could look something like this:

while read -r fileName
do
if [ "$fileName" == "$TODAY" ]; then
aws s3 sync $BUCKETURL /some/local/directory --exclude "*" --include "$fileName"
fi
done < "$FILE"

Where $FILE is your local filename index and $TODAY is the date you are searching for. You'll need to change the condition on this loop, but hopefully this can give you an idea.

Doing things this way relieves you of any charges related to listing the files in your bucket; but it also depends on the client you are conducting the search on having access to the local file list ... depending on your application / system architecture that might make this sort of approach unfeasible. Anyway, hope this helps and apologies again for my earlier derpy reply.

jwieder on 19 Jan 2016

👍7

Agreed and thank you

On Tue, Jan 19, 2016 at 10:00 AM, Josh Wieder [email protected]
wrote:

@PuchatekwSzortach https://github.com/PuchatekwSzortach @ChrisSLT
https://github.com/ChrisSLT You're right, sorry for my lame reply; and
I agree this sort of functionality would be very helpful in aws-cli. The
combination of leaving this basic feature out and billing for file listings
is highly suspect. Until AWS stops penny-pinching and introduces listing by
file properties, here's another idea that I've used that is more relevant
to this thread then my 1st reply: For files that need to be tracked in this
way, files are named with a timestamp. A list of files is stored in a local
text file (or could be db if you have gazillions of files to worry about).
Searching for a date then involves opening the file, looking for filenames
that match the today's date could look something like this:

while read -r fileName
do
if [ "$fileName" == "$TODAY" ]; then
aws s3 sync $BUCKETURL /some/local/directory --exclude "*" --include
"$fileName"
fi
done < "$FILE"

Where $FILE is your local filename index and $TODAY is the date you are
searching for. You'll need to change the condition on this loop, but
hopefully this can give you an idea.

Doing things this way relieves you of any charges related to listing the
files in your bucket; but it also depends on the client you are conducting
the search on having access to the local file list ... depending on your
application / system architecture that might make this sort of approach
unfeasible. Anyway, hope this helps and apologies again for my earlier
derpy reply.

—
Reply to this email directly or view it on GitHub
https://github.com/aws/aws-cli/issues/1104#issuecomment-172878454.

ChrisSLT on 19 Jan 2016

There is a way to do this with the s3api and the --query function. This is tested on OSX
aws s3api list-objects --bucket "bucket-name" --query 'Contents[?LastModified>=2016-05-20][].{Key: Key}'
you can then filter using jq or grep to do processing with the other s3api functions.

Edit: not sure why they are not showing up, but you have to use backticks to surround the date that you are querying

willstruebing on 25 May 2016

👍73 👎7 😕3

Is it possible for you to create folders for each day and that way, you will be accessing only todays files or at most yesterdays folders to get the latest files.

snandyala on 27 May 2016

👍3

yes. Although you may find it easier to simply use a date prefix for your keys (you cannot query a bucketname/foldername combination using the --bucket option). Using the date prefix will allow you to use the --prefix flag in the cli and speed up your queries as AWS recommends using numbers or hashes at the beginning of key names for increased response times.

willstruebing on 31 May 2016

@willstruebing, your solution still does not reduce the number of S3 API calls, server-side query complexity, or amount of data sent over the wire. The --query parameter performs client-side jmespath filtering only.

kislyuk on 20 Feb 2017

👍18 👎1

@kislyuk I agree completely that is does not answer the efficiency issues. However, my intention was to answer the specific question:

Is there a way to simply request a list of objects with a modified time <, >, = a certain timestamp?

That basic question is how I ended up on this thread, and so I thought it reasonable to include an answer to it. The issue is labeled "aws s3 ls - find files by modified date?".

I would love to hear anyone's ideas on the efficiency parts of the question, as I don't have one myself and am still curious.

willstruebing on 20 Feb 2017

👍9

#for i in s3cmd ls | awk {'print $3'} ; do aws s3 ls $i --recursive ; done >> s3-full.out

sreeninair on 15 Jun 2017

What is the default for AWS returning files? Does it return them in alphabeticaly order, or by most recent modified, or what is the criteria that is uses when you request your first batch of 1000 file names?

I agree that there certainly should be some kind of filter (sort by date, by name, etct) that you can use when you request files... definitely a missing feature. :(

jshrek on 24 Nov 2017

👍25

I agree this filtering should be server side and is a basic need.

don1uppa on 6 Mar 2018

👍48

+1 for server side querying/filtering

mpapetti on 8 Mar 2018

+1 for server side filtering

bugking on 28 Mar 2018

Still very needed indeed, +1

chescales on 9 Apr 2018

Agreed with @chescales and the rest, +1 to server side filtering

alecdotico on 1 May 2018

👍5

tonymporter on 22 May 2018

likeshumidity on 4 Jun 2018

jamieshiz on 7 Jun 2018

ramsaybell on 18 Jun 2018

PeterSzegedi on 18 Jun 2018

marouanehassanioptimistik on 18 Jun 2018

dubrox on 26 Jun 2018

👍1

gFaro on 27 Jun 2018

ZedYeung on 29 Jun 2018

AlexBantiuc on 10 Jul 2018

TheAvgTech on 17 Jul 2018

GitHubUUP on 18 Jul 2018

kinkerl on 20 Jul 2018

How is this not a feature already?

+100000

AdamShechter9 on 24 Jul 2018

+1e999

inletjenkins on 25 Jul 2018

utenakr on 27 Jul 2018

duginivijay on 27 Jul 2018

dlahyani on 1 Aug 2018

shuklaneerajdev on 1 Aug 2018

ilsundal on 1 Aug 2018

dmasyukov on 7 Aug 2018

muhufuk on 13 Aug 2018

CoeusCC on 15 Aug 2018

+65535

ysyyork on 16 Aug 2018

😄7 👍5

@willstruebing's comment worked for me, e.g.:

aws s3api list-objects --bucket "mybucket" --prefix "some/prefix" --query "Contents[?LastModified>=`2018-08-22`].{Key: Key}"

oh nevermind - I see after watching the network traffic from this command that all the keys are still being downloaded from s3 and aws cli is doing the filtering client side!

gfody on 22 Aug 2018

👍8

igiloh on 23 Oct 2018

tomisaacson on 24 Oct 2018

Alex-Willenbrink on 29 Oct 2018

naveen-venkat on 30 Oct 2018

what about the --exclude and --include filters?

!/bin/bash

DATE=$(date +%Y-%m-%d)
aws s3 ls s3://bucket.example.com/somefolder/ --exclude "" --include "${DATE}*"

umjohndacosta on 31 Oct 2018

Ak-sky on 9 Nov 2018

nalinguptalinux on 12 Nov 2018

+1 million

HarveyEV on 6 Dec 2018

😄9 👎1

mduca on 5 Jan 2019

+∞

Besjan on 22 Jan 2019

+∞+1

matneves on 24 Jan 2019

dvidr on 1 Feb 2019

umeshksingla on 2 Feb 2019

genki0406 on 3 Feb 2019

parhamfh on 5 Feb 2019

gubbaraviteja on 8 Mar 2019

dmead on 19 Mar 2019

farzaa on 21 Mar 2019

+1 :( :(

souuu on 26 Mar 2019

I think that is part of the pricing model of AWS, super cheap storage but pay to access. Good for large files but will ruin you if you want to query/manage millions of small files.

mehditlili on 10 Apr 2019

davidfetter on 18 Apr 2019

i guess this is why they created athena? another way to bill while adding some bells and whistles?

nickfreemandesign on 24 Apr 2019

antgus on 15 May 2019

nshaf on 16 May 2019

joshx0rfz on 21 May 2019

i have to list the s3 bucket objects which are modified in between two dates ex. 2019-06-08 to 2019-06-11

any idea anyone?

hemantkhokhar on 11 Jun 2019

aws s3api list-objects --bucket "BUCKET" --prefix "OPTIONAL" --query "Contents[?LastModified>='2019-06-08'][].{Key: Key,LastModified: LastModified}" and then use JQ or your preferred tool to filter out after 2019-06-11

willstruebing on 11 Jun 2019

👍2

That doesn't eliminate API calls. Those queries are clients side

On Tue, Jun 11, 2019, 2:07 PM willstruebing notifications@github.com
wrote:

aws s3api list-objects --bucket "BUCKET" --prefix "OPTIONAL" --query
"Contents[?LastModified>='2019-06-08'][].{Key: Key,LastModified:
LastModified}" and then use JQ or your preferred tool to filter out after
2019-06-11

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/aws/aws-cli/issues/1104?email_source=notifications&email_token=AABLGMW5AFAU5BUNM7FEMZ3PZ7SV3A5CNFSM4A2VNZ2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXOALCY#issuecomment-500958603,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABLGMVTIZDPPIEUK2CZR6TPZ7SV3ANCNFSM4A2VNZ2A
.

dmead on 11 Jun 2019

👍1

@dmead I agree completely. However, the functionality to do server side filtering does not currently exist (I think that's why so many people end up on this particular post), so this is the only workaround that I know of to complete the task at hand. Do you have a way to do it server side or is this just an observation about the proposed solution? I'd love to hear input on how to do it AND reduce the amount of API calls.

willstruebing on 12 Jun 2019

If you have the time, i'd look into selecting on metadata in athena. I
haven't had the chance myself, but that seemed like a possible solution.

On Wed, Jun 12, 2019 at 10:28 AM willstruebing notifications@github.com
wrote:

@dmead https://github.com/dmead I agree completely. However, the
functionality to do server side filtering does not currently exist (I think
that's why so many people end up on this particular post), so this is the
only workaround that I know of to complete the task at hand. Do you have a
way to do it server side or is this just an observation about the proposed
solution? I'd love to hear input on how to do it AND reduce the amount of
API calls.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/aws/aws-cli/issues/1104?email_source=notifications&email_token=AABLGMTQZD6OWVH4KDMSJPLP2EBY7A5CNFSM4A2VNZ2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQTN3Y#issuecomment-501298927,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABLGMRLA5OYSYGEYNPUY5DP2EBY7ANCNFSM4A2VNZ2A
.

dmead on 12 Jun 2019

+24

miryee on 1 Aug 2019

Everyone upvoting this, filing it with AWS CLI doesn't help. AWS CLI is bound by S3. File with the S3 team rather than a tool's github if you want it fixed :P

mike-bailey on 14 Aug 2019

👍2

@mike-bailey OK, and how do I do that?

baharev on 18 Aug 2019

👍2

If it were me personally I'd file an AWS ticket so it gets to the service team. But I don't work for AWS. I just know commenting '+1' on this isn't going to be the change.

mike-bailey on 18 Aug 2019

There is a way to do this with the s3api and the --query function. This is tested on OSX
aws s3api list-objects --bucket "bucket-name" --query 'Contents[?LastModified>=2016-05-20][].{Key: Key}'
you can then filter using jq or grep to do processing with the other s3api functions.

Edit: not sure why they are not showing up, but you have to use backticks to surround the date that you are querying

Make sure you have the latest version of awscli before trying this answer. I upgraded
awscli 1.11.47 -> 1.16.220
and it did the dreaded client-side filtering but it worked.
+1 for server-side filtering.

kamalptw on 20 Aug 2019

jaidisido on 17 Nov 2019

itsUnsmart on 15 Dec 2019

Please read the thread, +1 doesn’t do anything

mike-bailey on 15 Dec 2019

👍1

You can't do this easily but buried in these comments is the following tip:

 aws s3api list-objects --bucket "bucket-name" --query 'Contents[?LastModified>=`2016-05-20`][].{Key: Key}'

This is still client side and will perform plenty of requests.

atcol on 22 Dec 2019

👍1

As noted prior though, it handles it client side. So you still potentially slam the bucket with calls.

mike-bailey on 22 Dec 2019

👍1

Filtering should be server side and is a basic need I think.

akhilrajvc on 3 Jan 2020

👍2 👎1

Here is an example using aws s3 sync so only new files are downloaded. It combines the logs into one log file and strips the comments before saving the file. You can then use grep and things to get log data. In my case, I needed to count unique hits to a specific file. This code below was adapted from this link: https://shapeshed.com/aws-cloudfront-log/ The sed command works on Mac as well and is different then what is in the article. Hope this helps!

aws s3 sync s3://<YOUR_BUCKET> .
cat *.gz > combined.log.gz
gzip -d combined.log.gz
sed -i '' '/^#/ d' combined.log

# counts unique logs for px.gif hits
grep '/px.gif' combined.log | cut -f 1,8 | sort | uniq -c | sort -n -r

# above command will return something like below. The total count followed by the date and the file name.
17 2020-01-02 /px.gif
 9 2020-01-03 /px.gif