Zenodo: Student project: Optimisation of large file download/upload performance over HTTP

Created on 12 Feb 2016  ·  6Comments  ·  Source: zenodo/zenodo

Use case

  • A) A researcher wants to upload his/her 260GB research dataset to Zenodo. The researcher is only in possession of his/her browser to do the job. Variation: The researchers knows a bit of Python and can write a script that uploads a file to Zenodo via the API.
  • B) A researcher wants to download a 260GB research dataset from Zenodo.

We're talking about large files here (minimum 100mb,

Goal
Optimise performance of upload and/or download of data to/from Zenodo.

Upload
Move data from to Zenodo as fast as possible. Using 1) JavaScript (Browser) and/or 2) Python (API).

_Client side_
JavaScript: chunk file (many existing libraries support this - e.g. PLUpload), upload chunks in parallel (haven't seen this in any where - could use e.g. javascript web workers).

Python: Same thing as JavaScript - chunk/parallelize + look into HTTP pipeline and TCP layer as well.

Issues to take into account:

  • file integrity (how do we make sure that the file the user have is exactly what has uploaded (checksum the file, but if you upload in parallel your checksum algorithm should support it).
  • latency
  • browser support for features being used.

_Server side_

  • Process model (blocking/non-blocking IO) - we're using Gunicorn to run our Python processes which have different event models.
  • Performance analysis on the server side (e.g can the server pipe the files straight to file storage or does it need to keep the chunk in memory).
  • How can we increase the number of concurrent connections.

Download
Example 1: 260GB dataset in 1000 files. How can a researcher download the files easily without having to click 1000 links.
Example 2: 260GB dataset in 4 files. What happens if network connectivity is lost (i.e. resumable downloads).

_Client side_
Perhaps possible to write a JavaScript app that can help download the 1000 files. Same app could help with resumable downloads. Again, Javascript web worker model could possibly be used to download the file in chunks.

_Server side_
Implement support for HTTP range requests. Like upload (process model, and concurrent downloads). E.g. slow clients taking time to download a file will fill up the line.

Out-of-scope (but should be discussed): Other methods to optimise file delivery like Content Delivery Networks.


Python part will primarily go into inveniosoftware/invenio-files-rest
JavaScript parts will be separate repositories.

Most helpful comment

Is there a better way to download big files than using a common internet browser? Big files over http tend to fail within hours, and the download can't be resumed.

In my case, i'm trying to download a 50 GB dataset.
The download speed rounds 500 Kbps and the conection fails in between the 12 hours that the download lasts.

I've been trying to download it every day for months now (i need the dataset for my master's thesis).
Any suggestions?

All 6 comments

Sorry to resurrect this old issue in a somewhat off-topic direction ... I would very much like for the Zenodo HTTP server to support HTTP range requests, which are mentioned in the original comment. As far as I can tell, they're not currently honored. Is this on the roadmap at the moment?

Is there a better way to download big files than using a common internet browser? Big files over http tend to fail within hours, and the download can't be resumed.

In my case, i'm trying to download a 50 GB dataset.
The download speed rounds 500 Kbps and the conection fails in between the 12 hours that the download lasts.

I've been trying to download it every day for months now (i need the dataset for my master's thesis).
Any suggestions?

I couldn't even download a 2.2 GB dataset after 5 tries, download manager couldn't help either.

@Vichoko, did you manage to solve it? if yes, how?

I too have been very frustrated trying to download a dataset that includes two large files (12Gb and 37Gb) for days and days.

I found https://zenodo.org/record/1261813 (https://gitlab.com/dvolgyes/zenodo_get) and it did help a lot. I managed to download the whole record at first try.
At first sight I don't see anything magic about it, so I guess the trick must be in some of the internals of the Python implementation of wget.

Link: https://zenodo.org/api/files/cb4ca1fa-1db1-40f9-8f39-0e9d3b2af7ae/musdb18hq.zip   size: 21607.1 MB
  0% [                                             ]     3121152 / 22656664047

I could download 21GB files faster in 2006 with dial-up. Is Zenodo lacking CDN infrastructure? Why not use an S3 or GCS bucket?

I think this issue should be reopened, given that Zenodo exhibits abnormally slow and unstable downloads. Or if there's another issue to track Zenodo downloads?

Was this page helpful?
0 / 5 - 0 ratings