See for reference this dataset: https://zenodo.org/record/2539424
It looks like Zenodo is compressing gzipped files two times without notice. So they are "double compressed" (!). So, when you download them they should be named:
eswiki.wikilink_graph.2006-03-01.csv.gz.gz
the super confusing thing is that the MD5 displayed refers to the original file (compressed once):
eswiki.wikilink_graph.2006-03-01.csv.gz
md5:2036a75ed53acdcc81f53061057fe343
so this does not match the file once you download it, but it matches with the original file (compressed only once).
So that's what I get (I save the downloaded file as .gz.gz
):
$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz.gz
b32c3b896be22c32bb17b9fe887bda51 eswiki.wikilink_graph.2006-03-01.csv.gz.gz
$ gunzip eswiki.wikilink_graph.2006-03-01.csv.gz.gz
$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz
2036a75ed53acdcc81f53061057fe343 eswiki.wikilink_graph.2006-03-01.csv.gz
$ sha512sum eswiki.wikilink_graph.2006-03-01.csv.gz
56a48b0f82922fee20c07b0a2480bca1872c5e4aa8521a86e178dc00aea5dd2e4cda4ae05eb1b1950da7447dae70d47d8daa8d7a3c62d6989aaad09bd1fbcc71
eswiki.wikilink_graph.2006-03-01.csv.gz
$ zcat eswiki.wikilink_graph.2006-03-01.csv.gz | head -n10
page_id_from page_title_from page_id_to page_title_to
10 Argentina 3037 10 de enero
10 Argentina 3326 12 de octubre
10 Argentina 6301 13 de abril
10 Argentina 7874 1492
10 Argentina 6302 14 de abril
10 Argentina 14485 1502
10 Argentina 14471 1516
10 Argentina 14450 1536
10 Argentina 14434 1553
As you see, the downloaded file does not match with the give MD5 and if you uncompress the file once then the MD5 matches. Once uncompressed, it also matches with the SHA512 sums that I am providing separately in the same dataset in the file eswiki.wikilink_graph.sha512sums.txt
:
$ grep '2006-03-01' eswiki.wikilink_graph.sha512sums.txt
56a48b0f82922fee20c07b0a2480bca1872c5e4aa8521a86e178dc00aea5dd2e4cda4ae05eb1b1950da7447dae70d47d8daa8d7a3c62d6989aaad09bd1fbcc71
eswiki.wikilink_graph.2006-03-01.csv.gz
It is very strange because Zenodo is not compressing the text files (the README and the hashsum files) and there is no notice anywhere of this behavior.
EDIT: In case you are wondering (since that everyone I've told about this problem asked) I am sure that the files I have uploaded were compressed only once, I have them on my disk, they are still there and they are compressed once.
Thanks for reporting the issue.
I've reproduced the problem which is related to the Accept-Encoding: gzip
header which the browser is sending.
$ curl -H "Accept-Encoding: gzip" -o eswiki.wikilink_graph.2006-03-01.csv.gz https://zenodo.org/record/2539424/files/eswiki.wikilink_graph.2006-03-01.csv.gz?download=1
$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz
b32c3b896be22c32bb17b9fe887bda51 eswiki.wikilink_graph.2006-03-01.csv.gz
vs.
$ curl -o eswiki.wikilink_graph.2006-03-01.csv.gz https://zenodo.org/record/2539424/files/eswiki.wikilink_graph.2006-03-01.csv.gz?download=1
$ md5sum eswiki.wikilink_graph.2006-03-01.csv.gz
2036a75ed53acdcc81f53061057fe343 eswiki.wikilink_graph.2006-03-01.csv.gz
I'm looking further into this but it's either related to our NGINX configuration or somehow the MIMEtype being delivered by our application server.
Ok, so I think I've figured out what's going on. It's a combination of an application bug and misconfiguration.
The MIME type is guessed by the application server similar to this:
>>> import mimetypes
>>> mimetypes.guess_type('eswiki.wikilink_graph.2006-03-01.csv.gz')
('text/csv', 'gzip')
However, it's only using the first part text/csv
(the bug). As a security measure (because we accept any user uploaded file), the mimetype is sanitized before being sent to the browser causing text/csv
to be converted to text/plain
. NGINX then sees a text/plain
header content, which it is configured to compress before sending to the client to save bandwidth (the misconfiguration).
Great! Let me know if (and how) I can help.
Unfortunately a reconfiguration of NGINX didn't manage to solve the problem, so it's going to take a bit longer to fix it, as we also need to fix the bug in the application. I