Use case
We're talking about large files here (minimum 100mb,
Goal
Optimise performance of upload and/or download of data to/from Zenodo.
Upload
Move data from to Zenodo as fast as possible. Using 1) JavaScript (Browser) and/or 2) Python (API).
_Client side_
JavaScript: chunk file (many existing libraries support this - e.g. PLUpload), upload chunks in parallel (haven't seen this in any where - could use e.g. javascript web workers).
Python: Same thing as JavaScript - chunk/parallelize + look into HTTP pipeline and TCP layer as well.
Issues to take into account:
_Server side_
Download
Example 1: 260GB dataset in 1000 files. How can a researcher download the files easily without having to click 1000 links.
Example 2: 260GB dataset in 4 files. What happens if network connectivity is lost (i.e. resumable downloads).
_Client side_
Perhaps possible to write a JavaScript app that can help download the 1000 files. Same app could help with resumable downloads. Again, Javascript web worker model could possibly be used to download the file in chunks.
_Server side_
Implement support for HTTP range requests. Like upload (process model, and concurrent downloads). E.g. slow clients taking time to download a file will fill up the line.
Out-of-scope (but should be discussed): Other methods to optimise file delivery like Content Delivery Networks.
Python part will primarily go into inveniosoftware/invenio-files-rest
JavaScript parts will be separate repositories.
Sorry to resurrect this old issue in a somewhat off-topic direction ... I would very much like for the Zenodo HTTP server to support HTTP range requests, which are mentioned in the original comment. As far as I can tell, they're not currently honored. Is this on the roadmap at the moment?
Is there a better way to download big files than using a common internet browser? Big files over http tend to fail within hours, and the download can't be resumed.
In my case, i'm trying to download a 50 GB dataset.
The download speed rounds 500 Kbps and the conection fails in between the 12 hours that the download lasts.
I've been trying to download it every day for months now (i need the dataset for my master's thesis).
Any suggestions?
I couldn't even download a 2.2 GB dataset after 5 tries, download manager couldn't help either.
@Vichoko, did you manage to solve it? if yes, how?
I too have been very frustrated trying to download a dataset that includes two large files (12Gb and 37Gb) for days and days.
I found https://zenodo.org/record/1261813 (https://gitlab.com/dvolgyes/zenodo_get) and it did help a lot. I managed to download the whole record at first try.
At first sight I don't see anything magic about it, so I guess the trick must be in some of the internals of the Python implementation of wget.
Link: https://zenodo.org/api/files/cb4ca1fa-1db1-40f9-8f39-0e9d3b2af7ae/musdb18hq.zip size: 21607.1 MB
0% [ ] 3121152 / 22656664047
I could download 21GB files faster in 2006 with dial-up. Is Zenodo lacking CDN infrastructure? Why not use an S3 or GCS bucket?
I think this issue should be reopened, given that Zenodo exhibits abnormally slow and unstable downloads. Or if there's another issue to track Zenodo downloads?
Most helpful comment
Is there a better way to download big files than using a common internet browser? Big files over http tend to fail within hours, and the download can't be resumed.
In my case, i'm trying to download a 50 GB dataset.
The download speed rounds 500 Kbps and the conection fails in between the 12 hours that the download lasts.
I've been trying to download it every day for months now (i need the dataset for my master's thesis).
Any suggestions?