Requests: no way to read uncompressed content as file-like object

Created on 29 Feb 2012  ·  44Comments  ·  Source: psf/requests

According to the documentation, there are three ways to read the content of the response: .text, .content and .raw. The first two consider the transfer encoding and decompress the stream automatically when producing their in-memory result. However, especially for the case that the result is large, there is currently no simple way to get at the decompressed result in the form of a file-like object, e.g. to pass it straight into an XML or Json parser.

From the point of view of a library that aims to make HTTP requests user friendly, why should a user have to care about something as low-level as the compression type of the stream that was internally negotiated between the web server and the library? After all, it's the library's "fault" if it defaults to accepting such a stream. In this light, the .raw stream is a bit too raw for my taste.

Maybe a fourth property like .stream might provide a better abstraction level?

Most helpful comment

I already explained why this is a design bug and not a feature request: the existing API uses the wrong abstraction and leaks negotiation details of the connection into user space that are at the mercy of the remote site, and thus, that the user should not have to care about. That renders the current raw stream reading support hard to use. Essentially, this is a request for fixing a feature that's broken, not a request for a new feature.

All 44 comments

Response.iter_content

Erm, no, that's an iterator. I was asking for a file-like object, i.e. something that document processors can read from directly.

It would be pretty simple to make a file-like object with iter_content

Thanks for the quick reply, BTW.

I agree. Still, it would be even easier for requests to provide this functionality. My point is that .raw is the wrong level of abstraction for most use cases that want to read from the stream, because it exposes transfer level details.

Personally, I don't see a major use case for iterating line by line or even chunk by chunk over the result of an HTTP request, but I see several major use cases for parsing from it as a file-like object, specifically response formats that require a document parser, such as HTML, XML, Json etc.

Note also that it's much easier to write an iterator that wraps a file-like object, than a file-like object that wraps an iterator.

I came up with the following code. It handles all necessary cases, but I find it rather complex. That's why I said I wanted something like this as part of the library. Users shouldn't have to figure this out themselves.

I think that the code inside of requests' models.py uses the wrong abstraction here. It should decompress the raw stream _before_ it starts with its iteration machinery, not during iteration. Going from a file-like to an iterator just to go back to a file-like is just plain stupid. A single API transformation is more than enough and most users won't care about content iterators anyway.

class FileLikeDecompressor(object):
    """
    File-like object that wraps and decompresses an HTTP stream transparently.
    """
    def __init__(self, stream, mode='gzip'):
        self.stream = stream
        zlib_mode = 16 + zlib.MAX_WBITS if mode == 'gzip' else -zlib.MAX_WBITS  # magic
        self.dec = zlib.decompressobj(zlib_mode)
        self.data = ''

    def read(self, n=None):
        if self.dec is None:
            return '' # all done
        if n is None:
            data = self.data + self.dec.decompress(self.stream.read())
            self.data = self.dec = None
            return data
        while len(self.data) < n:
            new_data = self.stream.read(n)
            self.data += self.dec.decompress(new_data)
            if not new_data:
                self.dec = None
                break
        if self.data:
            data, self.data = self.data[:n], self.data[n:]
            return data
        return ''

def decompressed(response):
    """
    Return a file-like object that represents the uncompressed HTTP response data.
    For compressed HTTP responses, wraps the stream in a FileLikeDecompressor.
    """
    stream = response.raw
    mode = response.headers.get('content-encoding')
    if mode in ('gzip', 'deflate'):
        return FileLikeDecompressor(stream, mode)
    return stream

Why don't you build the file-like object from content_iter as proposed. This could look like:

class FileLikeFromIter(object):
    def __init__(self, content_iter):
        self.iter = content_iter
        self.data = ''

    def __iter__(self):
        return self.iter

    def read(self, n=None):
        if n is None:
            return self.data + '\n'.join(l for l in self.iter)
        else:
            while len(self.data) < n:
                try:
                    self.data = '\n'.join((self.data, self.iter.next()))
                except StopIteration:
                    break
            result, self.data = self.data[:n], self.data[n:]
            return result

You may want to read my comment again, specifically the paragraph that precedes the code I posted.

Yes, but this solution is still cleaner (and IMO easier) than doing the decompression at a second place because this is already built-in in requests.

But I agree with you in general, a r.file (or something like this) has much more use cases than r.raw. So I would like to see this included in requests, too. @kennethreitz

"response.stream" sounds like a good name to me.

This is what response.raw is for :)

That was also what I intuitively thought when I saw it. But then I realised that response.raw is broken because it exposes internal details of the underlying transport layer that users shouldn't have to care about.

The only method they should need is raw.read?

Well, yes - except that raw.read() behaves differently depending on the internal negotiations between client and server. It sometimes returns the expected data and sometimes it returns bare compressed bytes.

Basically, response.raw is a nice-to-have feature that most users would happily ignore and some power-users could find helpful, whereas a compression-independent response.stream is a feature that most streaming users would want.

+1

+1

Is this design bug going to be fixed?

not sure how correct or efficient this way is, but for me, the following works:

>>> import lxml  # a parser that scorns encoding
>>> unicode_response_string = response.text
>>> lxml.etree.XML(bytes(bytearray(unicode_response_string, encoding='utf-8')))  # provided unicode() means utf-8
<Element html at 0x105364870>

@kernc: That's a bizarre thing to be doing. response.content is already a bytestring, so what you're doing here is decoding the content with whatever the hell codec Python chooses, then re-encoding it as utf-8.

This is _not_ a bug, and it is quite definitely not the bug you suggested. If you really need a file-like object, I recommend StringIO and BytesIO.

@Lukasa is correct. content should always be a bytestring (in Python 3 it's an explicit bytestring; in Python 2 str == bytes). The only item that is not a bytestring is text.

@kennethreitz any news on this? This is a pretty serious design bug and it's best to sort it out early. The more code gets written to work around it, the more costly it becomes for everyone.

This is no design bug, it is just a feature request. And as requests has a feature freeze I assume this won't make it in requests anytime soon (if at all)...

I don't think redeclaring a long standing design bug a "missing feature"
makes it go away all that easily. I heard that the author is thinking about
making "requests" part of the Python stdlib. That would be a good
opportunity to fix this.

I heard that the author is thinking about
making "requests" part of the Python stdlib.

Not really: http://docs.python-requests.org/en/latest/dev/philosophy/#standard-library

This is not a bug, it's a feature request. Requests isn't doing anything wrong, it's simply not doing something that is optional. That's the very definition of a feature.

Additionally, preparing for the stdlib is exactly why Requests is in feature freeze. Once Requests is in the stdlib it becomes very hard to make timely bug fixes. As a result, if adding the new feature adds bugs or regresses behaviour, the version in stdlib can't be fixed until the next minor release. That would be bad.

Marc Schlaich, 19.03.2013 08:41:

I heard that the author is thinking about
making "requests" part of the Python stdlib.

Not really: http://docs.python-requests.org/en/latest/dev/philosophy/#standard-library

I read it here:

http://python-notes.boredomandlaziness.org/en/latest/conferences/pyconus2013/20130313-language-summit.html

Stefan

I already explained why this is a design bug and not a feature request: the existing API uses the wrong abstraction and leaks negotiation details of the connection into user space that are at the mercy of the remote site, and thus, that the user should not have to care about. That renders the current raw stream reading support hard to use. Essentially, this is a request for fixing a feature that's broken, not a request for a new feature.

Let me sum this up cleanly. The bug is that any real world usage of the raw stream reading feature will have to reimplement a part of the library, specifically the entire conditional stream decompression part, because the feature is useless without it, as soon as the client permits compression. We are talking about code here that is already there, in "requests" - it's just used in the wrong spot. It should be used below the raw reading level, not above it, because the client cannot control if the server honours the accept header or not. Compression should be a transparent negotiation detail of the connection, not something that hurts any user who enables the relevant header.

I cannot think of any use case where the client would be interested in the compressed stream, especially if it cannot predict if the stream will really be compressed or not, as the server can happily ignore the client's wish. It's a pure negotiation detail. That's why raw stream reading uses the wrong abstraction by prefering the extremely unlikely use case over the most common one.

I can. For instance, what if you were downloading a large text-based file and wanted to keep it compressed? I could follow-up this change with a new 'design bug' entitled No way to save originally-compressed data to disk.

That idea is intentionally trite and stupid, but I'm trying to illustrate a point, which is this: Requests is not obliged to offer everyone exactly the interaction mechanism they desire. In fact, doing so would run directly counter to the main goal Requests has, which is simplicity of API. There is a long, long, _long_ list of proposed changes to Requests that were objected to because they complicate the API, even though they added useful functionality. Requests does not aim to replace urllib2 for all use cases, it aims to simplify the most common cases.

In this case, Requests assumes that most users don't want file-like objects, and therefore proposes the following interactions:

  • Response.text and Response.content: You want all the data in one go.
  • Response.iter_lines() and Response.iter_content(): You don't want all the data in one go.
  • Response.raw: You aren't happy with the other two options so do it yourself.

These were chosen because they overwhelmingly represent the common uses of Requests. You have said "most users won't care about content iterators anyway" and "response.stream is a feature most streaming users would want". Experience on this project leads me to disagree: a great many people use the content iterators, and not many desperately want file-like objects.

One final point: if compression should be a transparent negotiation detail of the connection, then you should raise the appropriate bug against urllib3, which handles our connection logic.

I am sorry that you feel like Requests is inappropriate for your use case.

I get your point that response.raw is broken in the current implementation and even partially agree with that (you should at least be able to get compression details without parsing the headers).

However, your proposal is still a feature requests...

@Lukasa
I can't really see how filing the bug against urllib3 would fix the API of requests, at least not all by itself.

And I agree that your "use case" is contrieved. As I said, if the client cannot positively control the compression on the server side (and it disable it, but not reliably enable it), so relying on it to be able to save a compressed file to disk is, well, not so interesting.

@schlamar
I agree that it can be read as such. I assure you that I'm fine with anything that solves this problem. If opening a new ticket is required in order to get there, so be it.

If opening a new ticket is required in order to get there, so be it.

I still think that Kenneth will reject this due to the feature freeze.

I'm fine with anything that solves this problem

  1. Wrap iter_content as file-like object or
  2. Parse the headers and decompress response.raw if appropriate

Both solutions are in comments above, the latter one posted by you. Why is it such an issues that this won't be in requests directly?

Let's be 100% clear here: there is basically no chance this will get into the Requests while it's in feature freeze. Nothing is broken, the API is just not perfect for your needs. Because nothing is broken, the only thing that matters is whether Kenneth wants it. Requests is not a democracy, it's one man one vote. Kenneth is the man, he has the vote. Kenneth closed this issue 8 months ago, so it seems pretty clear he doesn't want it.

I can't really see how filing the bug against urllib3 would fix the API of requests, at least not all by itself.

Patching urllib3 to always return the uncompressed file-object should solve this by itself (not said that this is a good idea).

Oh, here is solution number 3 (untested):

response.raw.read = functools.partial(response.raw.read, decode_content=True)

See https://github.com/shazow/urllib3/blob/master/urllib3/response.py#L112

Interesting - I didn't know that existed by now. That makes it much easier to wrap the feature, sure.

Although, does that actually work? I.e. are the decompressors stateful and incremental? The second call to read(123) will not return the valid start of a gzip file anymore, for example.

Although, does that actually work? I.e. are the decompressors stateful and incremental?

Oh, doesn't seem so. I didn't read the docstring.

However, here is my proposal:

  1. Patch urllib3 so that HTTPResponse.read works with amt and decode_content concurrently.
  2. Make HTTPResponse._decode_content a public member (so you can do response.raw.decode_content = True instead of patching the read method).
  3. Drop decompression in requests completely by using decode_content=True in iter_content

@Lukasa I think this won't violate the feature freeze, right?

@schlamar: In principle, sure. As long as the API remains unchanged, internal changes _should_ be ok, and I'd be +1 on this one. However, bear in mind that I'm not the BDFL, =)

stream_decompress in requests is broken anyways: #1249

+1

Was this page helpful?
0 / 5 - 0 ratings

Related issues

8key picture 8key  ·  3Comments

JimHokanson picture JimHokanson  ·  3Comments

thadeusb picture thadeusb  ·  3Comments

xsren picture xsren  ·  3Comments

avinassh picture avinassh  ·  4Comments