werkzeug.formparser is really slow with large binary uploads

Created on 3 Mar 2016 · 25Comments · Source: pallets/werkzeug

When I perform a multipart/form-data upload of any large binary file in Flask, those uploads are very easily CPU bound (with Python consuming 100% CPU) instead of I/O bound on any reasonably fast network connection.

A little bit of CPU profiling reveals that almost all CPU time during these uploads is spent in werkzeug.formparser.MultiPartParser.parse_parts(). The reason this that the method parse_lines() yields _a lot_ of very small chunks, sometimes even just single bytes:

# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
    yield _cont, buf
    buf = b''

So parse_parts() goes through a lot of small iterations (more than 2 million for a 100 MB file) processing single "lines", always writing just very short chunks or even single bytes into the output stream. This adds a lot of overhead slowing down those whole process and making it CPU bound very quickly.

A quick test shows that a speed-up is very easily possible by first collecting the data in a bytearray in parse_lines() and only yielding that data back into parse_parts() when self.buffer_size is exceeded. Something like this:

buf = b''
collect = bytearray()
for line in iterator:
    if not line:
        self.fail('unexpected end of stream')

    if line[:2] == b'--':
        terminator = line.rstrip()
        if terminator in (next_part, last_part):
            # yield remaining collected data
            if collect:
                yield _cont, collect
            break

    if transfer_encoding is not None:
        if transfer_encoding == 'base64':
            transfer_encoding = 'base64_codec'
        try:
            line = codecs.decode(line, transfer_encoding)
        except Exception:
            self.fail('could not decode transfer encoded chunk')

    # we have something in the buffer from the last iteration.
    # this is usually a newline delimiter.
    if buf:
        collect += buf
        buf = b''

    # If the line ends with windows CRLF we write everything except
    # the last two bytes.  In all other cases however we write
    # everything except the last byte.  If it was a newline, that's
    # fine, otherwise it does not matter because we will write it
    # the next iteration.  this ensures we do not write the
    # final newline into the stream.  That way we do not have to
    # truncate the stream.  However we do have to make sure that
    # if something else than a newline is in there we write it
    # out.
    if line[-2:] == b'\r\n':
        buf = b'\r\n'
        cutoff = -2
    else:
        buf = line[-1:]
        cutoff = -1

    collect += line[:cutoff]

    if len(collect) >= self.buffer_size:
        yield _cont, collect
        collect.clear()

This change alone reduces the upload time for my 34 MB test file from 4200 ms to around 1100 ms over localhost on my machine, that's almost a 4X increase in performance. All tests are done on Windows (64-bit Python 3.4), I'm not sure if it's as much of a problem on Linux.

It's still mostly CPU bound, so I'm sure there is even more potential for optimization. I think I'll look into it when I find a bit more time.

bug

Source

sekrause

👍6

Most helpful comment

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It's working great for me. https://github.com/siddhantgoel/streaming-form-data

sdizazzo on 2 Jun 2017

👍5

All 25 comments

I also have same problem, when I upload an iso file(200m), the first call to request.form will take 7s

languanghao on 18 Mar 2016

2 things seem interesting for further optimization - experimenting with cython, and experimenting with interpreting the content-site headers for smarter mime message parsing

(no need to scan for lines if you know the content-length of a sub-message)

RonnyPfannschmidt on 18 Mar 2016

Just a quick note, that if you stream the file directly in the request body (i.e. no application/multipart-formdata), you completely bypass the form parser and read the file directly from request.stream.

lnielsen on 22 Aug 2016

I have the same issue with slow upload speeds with multipart uploads when using jQuery-File-Upload's chunked upload method. When using small chunks (~10MB), the transfer speed jumps between 0 and 12MB/s while the network and server are fully capable of speeds over 50MB/s. The slowdown is caused by the cpu bound multipart parsing which takes about the same time as the actual upload. Sadly, using streaming uploads to bypass the multipart parsing is not really an option as I must support iOS devices that can't do streaming in the background.

The patch provided by @sekrause looks nice but doesn't work in python 2.7.

carbn on 18 Oct 2016

@carbn: I was able to get the patch to work in Python 2.7 by changing the last line to collect = bytearray(). This just creates a new bytearray instead of clearing the existing one.

cuibonobo on 18 Oct 2016

@cuibonobo: That's the first thing I changed but still had another error. I can't check the working patch at the moment, but IIRC the yields had to be changed from yield _cont, collect to yield _cont, str(collect). This allowed the code to be tested and the patch yielded about 30% increase in the multipart processing speed. It's a nice speedup, but the performance is still pretty bad.

carbn on 18 Oct 2016

A little further investigation shows that werkzeug.wsgi.make_line_iter is already too much of a bottleneck to really be able to optimize parse_lines(). Look at this Python 3 test script:

import io
import time
from werkzeug.wsgi import make_line_iter

filename = 'test.bin' # Large binary file
lines = 0

# load a large binary file into memory
with open(filename, 'rb') as f:
    data = f.read()
    stream = io.BytesIO(data)
    filesize = len(data) / 2**20 # MB

start = time.perf_counter()
for _ in make_line_iter(stream):
    lines += 1
stop = time.perf_counter()
delta = stop - start

print('File size: %.2f MB' % filesize)
print('Time: %.1f seconds' % delta)
print('Read speed: %.2f MB/s' % (filesize / delta))
print('Number of lines yielded by make_line_iter: %d' % lines)

For a 923 MB video file with Python 3.5 the output look something like this on my laptop:

File size: 926.89 MB
Time: 20.6 seconds
Read speed: 44.97 MB/s
Number of lines yielded by make_line_iter: 7562905

So even if you apply my optimization above and optimize it further until perfection you'll still be limited to ~45 MB/s for large binary uploads simply because make_line_iter can't give you the data fast enough and you'll be doing 7.5 million iterations for 923 MB of data in your loop that checks for the boundary.

I guess the only great optimization will be to completely replace parse_lines() with something else. A possible solution that comes to mind is to read a reasonably large chunk of the stream into memory then use string.find() (or bytes.find() in Python 3) to check if the boundary is in the chunk. In Python find() is a highly optimized string search algorithm written in C, so that should give you some performance. You would just have to take care of the case where the boundary might be right between two chunks.

sekrause on 18 Oct 2016

sdizazzo on 2 Jun 2017

👍5

I guess the only great optimization will be to completely replace parse_lines()

+1 for this.

I am writing a bridge to stream user's upload directly to S3 without any intermediate temp files, possibly with backpressure, and I find werkzeug and flask situation frustrating. You can't move data directly between two pipes.

lambdaq on 20 Jun 2017

@lambdaq I agree it's a problem that needs to be fixed. If this is important to you, I'd be happy to review a patch changing the behavior.

davidism on 20 Jun 2017

@lambdaq Note that if you just stream data directly in the request body and use application/octet-stream then the form parser doesn't kick in at all and you can use request.stream (i.e. no temp files etc).

The only problem we had is the werkzeug form parser is eagerly checking content length against the allowed max content length before knowing if it should actually parse the request body.

This prevents you from setting max content length on normal form data, but also allow very large file uploads.

We fixed it by reordering the check the function a bit. Not sure if it makes sense to provide this upstream as some apps might rely on the existing behaviour.

lnielsen on 20 Jun 2017

Note that if you just stream data directly in the request body and use application/octet-stream then the form parser doesn't kick in at all and you can use request.stream (i.e. no temp files etc).

Unfortunately not. It's just normal form uploads with multipart.

I'd be happy to review a patch changing the behavior.

I tried to hack werkzeug.wsgi.make_line_iter or parse_lines() using generators's send(), so we can signal_iter_basic_lines() to emit whole chunks instead of lines. It turns out not so easy.

Basically, the rabbit whole starts with 'itertools.chain' object has no attribute 'send'.... 😂

lambdaq on 21 Jun 2017

I wonder how much this code could be sped up using native speedups written in C (or Cython etc.). I think handling semi-large (a few 100 MB, but not huge as in many GB) files more efficiently is important without having to change how the app uses them (ie streaming them directly instead of buffering) - for many applications this would be overkill and is not absolutely necessary (actually, even the current somewhat slow performance is probably OK for them) but making things faster is always nice!

ThiefMaster on 21 Jun 2017

Another possible solution is offload the multipart parsing job to nginx

https://www.nginx.com/resources/wiki/modules/upload/

lambdaq on 30 Sep 2017

@ThiefMaster

See

https://github.com/hydrogen18/multipart-python

https://github.com/defnull/multipart

lambdaq on 14 Dec 2017

Both repos look dead.

ThiefMaster on 14 Dec 2017

so is there no known solution to this?

mdemin914 on 1 Feb 2018

There's a workaround👆

lnielsen on 1 Feb 2018

Under uwsgi, we use it's built in chunked_read() function and parse the stream on our own as it comes in. It works 99% of the time, but it has a bug that I have yet to track down. See my earlier comment for an out-of-the box streaming form parser. Under python2 it was slow, so we rolled our own and it is fast. :)

sdizazzo on 2 Feb 2018

Quoting from above:

I agree it's a problem that needs to be fixed. If this is important to you, I'd be happy to review a patch changing the behavior.

I don't really have time to work on this right now. If this is something that you are spending time on, please consider contributing a patch. Contributions are very welcome.

davidism on 2 Feb 2018

@sdizazzo

but it has a bug that I have yet to track down

are you talking about streaming-form-data? if so, I'd love to know what the bug is.

siddhantgoel on 3 Feb 2018

Our problem was that the slow form processing prevented concurrent request handling which caused nomad to think the process was hung and killed it.

My fix was to add a sleep(0) in werkzeug/formparser.py:MutlipartParser.parse_lines():

            for i, line in enumerate(iterator):
                if not line:
                    self.fail('unexpected end of stream')

                # give other greenlets a chance to run every 100 lines
                if i % 100 == 0:
                    time.sleep(0)

search for unexpected end of stream if you want to apply this patch.

kneufeld on 21 Jul 2018

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It's working great for me. https://github.com/siddhantgoel/streaming-form-data

seconded.
this speeds up file uploads to my Flask app by more than factor 10

patrislav1 on 12 Oct 2018

🎉1

@siddhantgoel
Thanks a lot for your fix with streaming-form-data. I can finally upload gigabyte sized files at good speed and without memory filling up!