Requests: Streaming gzipped responses

Created on 31 Jul 2014 · 10Comments · Source: psf/requests

I need to process big XML responses as a stream. The uncompressed responses can be multiple hundred megabytes in size, so loading them entirely into memory before handing them to the XML parser is not an option.

I'm using lxml to parse and I just hand the response.raw to its iterparse() function, as described somewhere in the requests docs. This works fine for uncompressed responses.

Unfortunately, the API I'm calling isn't particularly good. So it will sometimes return Content-Encoding: gzip even if I explicitly ask for uncompressed data. Also, the compression ratio on these extremely repetitive and verbose XML files is really good (10x+), so I'd really like to make use of compressed responses.

Is this possible with requests? I couldn't find it in the documentation. Researching deeper into urllib3, its HTTPResponse.read() method seems to support a decode_content parameter. If not set, urllib3 falls back to what's set in the constructor. When requests calls the constructor in requests.adapters.HTTPAdapter.send(), it explicitly sets decode_content to False.

Is there a reason why requests does that?

Strangely, iter_content() actually sets decode_content=True while reading. Why here? It all appears a bit arbitrary. I don't really understand the motivation for doing it one way here and another way there.
Personally, I can't really use iter_content() of course because I need a file-like object for lxml.

I previously wrote my own file-like object that I can hook in between requests and lxml, but of course buffering is hard and I feel like smarter people than me have written this before, so I'd prefer to not have to roll my own.

What's your advice how to handle this? Should requests be changed to default to setting decode_content=True in urllib3?

Contributor Friendly Documentation Planned

Source

hheimbuerger

Most helpful comment

I've done this in the past

r = requests.get('url', stream=True)
r.raw.decode_content = True
...

reubano on 19 Mar 2017

👍13

All 10 comments

No it should not default to setting that for a wide variety of reasons. What you should do is use functools.partial to replace the read method on the response (or just wrap it another way) so that you do something like:

response.raw.read = functools.partial(response.raw.read, decode_content=True)

and then pass response.raw to your parser.

sigmavirus24 on 31 Jul 2014

👍1

@sigmavirus24 Thanks, that's definitely an elegant solution to the problem I outlined above!

I would recommend adding that to requests' documentation, e.g. in the FAQ: http://docs.python-requests.org/en/latest/community/faq/#encoded-data
Currently, the statement "Requests automatically decompresses gzip-encoded responses" is not correct for the stream=True case and can lead to surprises.

As for my problem, as you've read on the urllib3 issue, the urllib3 implementation of the gzip decompression has its own little quirks I have to work around in my code, but that is no longer a problem for requests.

hheimbuerger on 1 Aug 2014

👍3 👎1

but that is no longer a problem for requests.

As in you feel this can be closed?

sigmavirus24 on 1 Aug 2014

@sigmavirus24 I believe it should be documented, as the current documentation is incorrect.

But if you disagree with that, yes, close away!

hheimbuerger on 1 Aug 2014

The documentation could be clearer. To me (and this is entirely because I'm a core developer) the first paragraph speaks to the 90% of users who will never touch the raw response, while the second paragraph contradicts the first in saying "but if you need to access the raw data, it's there for you". Like I said, that's apparent to me, but I can see how that could be made clearer. I'll work on that tonight.

sigmavirus24 on 1 Aug 2014

👍1

For me, it's more that I would have interpreted "raw data" as "raw payload", i.e. a decompressed stream. I just have to read it in whatever chunks I need. As opposed to .content, which is a decompressed blob (also the payload, but in a different form).

The actual decompression feels like a concern of the HTTP library to me—an implementation detail of HTTP if you will, one that I would expect requests to abstract away. Whether I read the payload from requests as a stream or as a prefetched blob of data wouldn't make a difference. Either way, requests would abstract the implementation detail 'compression'.

(This assumption was also at the core of my original request to default decode_content to True. Of course now that I see what a leaky abstraction this is, I'm no longer suggesting that.)

But yeah, I absolutely agree that 99% of your users will never be affected by this detail.

Feel free to close this issue.

hheimbuerger on 1 Aug 2014

So this actually leads to something that's been rattling around in my head for a while and which I haven't proposed yet because it would be a significant API change.

I don't like the fact that we suggest people use r.raw because it's an object which we don't document and it's an object provided by urllib3 (which we've claimed in the past is more of an implementation detail). With that in mind, I've been toying with the idea of providing methods on a Response object which just proxy to the urllib3 methods (read would just proxy to raw.read, etc.). This gives us extra flexibility around urllib3 and allows us to handle (on behalf of the users) an API change in urllib3 (which historically has almost never been a problem, so there isn't any urgency in that).

With that said, we already have enough methods on a Response object in my opinion and growing our API isn't ideal. The best API is the API from which there's nothing left to remove. So I'm continuously on the fence about this.

This assumption was also at the core of my original request to default decode_content to True. Of course now that I see what a leaky abstraction this is, I'm no longer suggesting that.

For others who find this and may not be certain why this is true, allow me to explain.

There are several users of requests who turn off automatic decompression to validate the length of a response, or to do other important things with it. One consumer of the former kind is OpenStack. Many of the OpenStack clients validate the Content-Length header sent to the client and the actual length of the body received. To them, handling decompression is a fair trade-off to be certain they're receiving and handling a valid response.

Another consumer is Betamax (or really any tool that (re)constructs Response objects) because when it is handling the full process of making a totally valid response, it needs the content to be in the compressed format.

I'm sure there are others that neither @Lukasa or I know about that also rely heavily on this behaviour.

sigmavirus24 on 1 Aug 2014

Hit the same issue today, and ended up making the same assumption as there is no other way to stream responses at the moment.

Rather than multiple new methods on Response why not a single new attribute e.g. response.stream which would play the same role of proxy to .raw? It would also nicely mirror the stream=True setting/parameter, and would not affect users needing the current .raw behavior.

masklinn on 19 Dec 2014

👍8 🎉1

I've done this in the past

r = requests.get('url', stream=True)
r.raw.decode_content = True
...

reubano on 19 Mar 2017

👍13

Note that the workaround by @sigmavirus24 breaks the semantics of the tell method, which will return incorrect offsets.

I ran into this when streaming a response as a resumable upload into the Google Cloud Storage API, which uses tell() to figure out the number of bytes that were just read (here).