Zenodo: Gzipped files are compressed two times

Created on 18 Feb 2019  ·  5Comments  ·  Source: zenodo/zenodo

See for reference this dataset: https://zenodo.org/record/2539424

It looks like Zenodo is compressing gzipped files two times without notice. So they are "double compressed" (!). So, when you download them they should be named:

eswiki.wikilink_graph.2006-03-01.csv.gz.gz

the super confusing thing is that the MD5 displayed refers to the original file (compressed once):

eswiki.wikilink_graph.2006-03-01.csv.gz

md5:2036a75ed53acdcc81f53061057fe343

so this does not match the file once you download it, but it matches with the original file (compressed only once).

So that's what I get (I save the downloaded file as .gz.gz):

$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz.gz
b32c3b896be22c32bb17b9fe887bda51  eswiki.wikilink_graph.2006-03-01.csv.gz.gz

$ gunzip eswiki.wikilink_graph.2006-03-01.csv.gz.gz

$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz
2036a75ed53acdcc81f53061057fe343  eswiki.wikilink_graph.2006-03-01.csv.gz

$ sha512sum eswiki.wikilink_graph.2006-03-01.csv.gz
56a48b0f82922fee20c07b0a2480bca1872c5e4aa8521a86e178dc00aea5dd2e4cda4ae05eb1b1950da7447dae70d47d8daa8d7a3c62d6989aaad09bd1fbcc71
 eswiki.wikilink_graph.2006-03-01.csv.gz

$ zcat eswiki.wikilink_graph.2006-03-01.csv.gz | head -n10
page_id_from    page_title_from page_id_to  page_title_to
10  Argentina   3037    10 de enero
10  Argentina   3326    12 de octubre
10  Argentina   6301    13 de abril
10  Argentina   7874    1492
10  Argentina   6302    14 de abril
10  Argentina   14485   1502
10  Argentina   14471   1516
10  Argentina   14450   1536
10  Argentina   14434   1553

As you see, the downloaded file does not match with the give MD5 and if you uncompress the file once then the MD5 matches. Once uncompressed, it also matches with the SHA512 sums that I am providing separately in the same dataset in the file eswiki.wikilink_graph.sha512sums.txt:

$ grep '2006-03-01' eswiki.wikilink_graph.sha512sums.txt
56a48b0f82922fee20c07b0a2480bca1872c5e4aa8521a86e178dc00aea5dd2e4cda4ae05eb1b1950da7447dae70d47d8daa8d7a3c62d6989aaad09bd1fbcc71
 eswiki.wikilink_graph.2006-03-01.csv.gz

It is very strange because Zenodo is not compressing the text files (the README and the hashsum files) and there is no notice anywhere of this behavior.


EDIT: In case you are wondering (since that everyone I've told about this problem asked) I am sure that the files I have uploaded were compressed only once, I have them on my disk, they are still there and they are compressed once.

All 5 comments

Thanks for reporting the issue.

I've reproduced the problem which is related to the Accept-Encoding: gzip header which the browser is sending.

$ curl -H "Accept-Encoding: gzip" -o eswiki.wikilink_graph.2006-03-01.csv.gz https://zenodo.org/record/2539424/files/eswiki.wikilink_graph.2006-03-01.csv.gz?download=1
$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz
b32c3b896be22c32bb17b9fe887bda51  eswiki.wikilink_graph.2006-03-01.csv.gz

vs.

$ curl -o eswiki.wikilink_graph.2006-03-01.csv.gz https://zenodo.org/record/2539424/files/eswiki.wikilink_graph.2006-03-01.csv.gz?download=1
$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz
2036a75ed53acdcc81f53061057fe343  eswiki.wikilink_graph.2006-03-01.csv.gz

I'm looking further into this but it's either related to our NGINX configuration or somehow the MIMEtype being delivered by our application server.

Ok, so I think I've figured out what's going on. It's a combination of an application bug and misconfiguration.

The MIME type is guessed by the application server similar to this:

>>> import mimetypes
>>> mimetypes.guess_type('eswiki.wikilink_graph.2006-03-01.csv.gz')
('text/csv', 'gzip')

However, it's only using the first part text/csv (the bug). As a security measure (because we accept any user uploaded file), the mimetype is sanitized before being sent to the browser causing text/csv to be converted to text/plain. NGINX then sees a text/plain header content, which it is configured to compress before sending to the client to save bandwidth (the misconfiguration).

Great! Let me know if (and how) I can help.

Unfortunately a reconfiguration of NGINX didn't manage to solve the problem, so it's going to take a bit longer to fix it, as we also need to fix the bug in the application. I

Was this page helpful?
0 / 5 - 0 ratings