Requests: Response.text returns improperly decoded text (requests 1.2.3, python 2.7)

Created on 15 Sep 2013 · 24Comments · Source: psf/requests

If http server returns Content-type: text/* without encoding, Response.text always decode it as 'ISO-8859-1' text.

It may be valid in RFC2616/3.7.1, but this is wrong in real life in 2013.

I made example page with chinese text:
http://lavr.github.io/python-emails/tests/requests/some-utf8-text.html
All browsers renders this page properly.
But reguests.get returns invalid text.

And here is simple test with that url:
https://gist.github.com/lavr/6572927

Source

lavr

Most helpful comment

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

kennethreitz42 on 15 Sep 2013

👍27 ❤4

All 24 comments

Thanks for this @lavr!

This is a deliberate design decision for Requests. We're following the spec unless we find ourselves in a position where the specification diverges so wildly from real world behaviour that it becomes a problem (e.g. GET after 302 response to a POST).

If the upstream server knows what the correct encoding is, it should signal it. Otherwise, we're going to follow what the spec says. =)

If you think the spec default is a bad one, I highly encourage you to get involved with the RFC process for HTTP/2.0 in order to get this default changed. =)

Lukasa on 15 Sep 2013

What @Lukasa said + the fact that if the encoding retrieved from the headers is non-existent we rely on charade to guess at the encoding. With so few characters, charade will not return anything definitive because it uses statistical data to guess at what the right encoding is.

Frankly, the year makes no difference and does not change specification either.

If you know what encoding you're expecting you can also do the decoding yourself like so:

text = str(r.content, '<ENCODING>', errors='replace')

There is nothing wrong with requests as far as I'm concerned and this is not a bug in charade either. Since @Lukasa seems to agree with me, I'm closing this.

sigmavirus24 on 15 Sep 2013

❤1

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

kennethreitz42 on 15 Sep 2013

👍27 ❤4

@kennethreitz that's disappointing. Why are we making that easy for people? =P

sigmavirus24 on 15 Sep 2013

Absolutely :)

kennethreitz42 on 15 Sep 2013

Mostly for Japanese websites. They all lie about their encoding.

kennethreitz42 on 15 Sep 2013

@sigmavirus24
please note, that utils.get_encoding_from_headers always returns 'ISO-8859-1', and charade has no chance to be called.
so bug is: we expect that charade is used to guess encoding, but it is not.

lavr on 15 Sep 2013

A patch above fixes a bug, but still follows RFC.
Please, consider to review it.

lavr on 15 Sep 2013

@lavr Sorry, we didn't make this very clear. We do _not_ expect charade to be called in this case. The RFC is very clear: if you don't specify a charset, and the MIME type is text/*, the encoding must be assumed to be ISO-8859-1. That means "don't guess". =)

Lukasa on 16 Sep 2013

@lavr: just set r.encoding to None, and it'll work as you expect (I think).

kennethreitz42 on 16 Sep 2013

Or do r.encoding = r.apparent_encoding.

Lukasa on 16 Sep 2013

👍2

Even better.

kennethreitz42 on 16 Sep 2013

On r.encoding = None and r.encoding = r.apparent_encoding we lost server charset information.
Totally ignoring server header is not good solution, I think.

Right solution is something like this:

r = requests.get(...)
params = cgi.parse_header(r.headers.get('content-type'))[0]
server_encoding = ('charset' in params) and params['charset'].strip("'\"") or None
r.encoding = server_encoding or r.apparent_encoding
text = r.text

Looks weird :(

lavr on 16 Sep 2013

Or do this:

r = requests.get(...)

if r.encoding is None or r.encoding == 'ISO-8859-1':
    r.encoding = r.apparent_encoding

Lukasa on 16 Sep 2013

I don't think so :)

Condition r.encoding is None has no sense, because r.encoding can never be None for content-type=text/*.

r.encoding == 'ISO-8859-1'... what does it mean ? Server sent charset='ISO-8859-1' or server sent no charset? If first, I shouldn't guess charset.

lavr on 16 Sep 2013

@lavr I was covering the non-text bases. You can rule out the charset possibility by using this condition instead:

r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in r.headers.get('Content-Type', '')

Lukasa on 16 Sep 2013

@Lukasa
Well, I can use this hack.
And everybody in Eastern Europe and Asia can use it.

But what if we fix it in requests ? ;)
What if requests can honestly set enconding=None on response without charset ?

lavr on 16 Sep 2013

As we've discussed many times, Requests is following the HTTP specification to the letter. The current behaviour is not wrong. =)

Lukasa on 16 Sep 2013

The fact that it is not helpful for your use case is a whole other story. =)

Lukasa on 16 Sep 2013

Alright, that's enough discussion on this. Thanks for the feedback.

kennethreitz42 on 16 Sep 2013

Updated HTTP 1.1 obsoletes ISO-8859-1 default charset: http://tools.ietf.org/html/rfc7231#appendix-B

lavr on 8 Jun 2014

We're already tracking this in #2086. =)

Lukasa on 8 Jun 2014

To whom it may concern, here is a compatibility patch

create file requests_patch.py with following code and import it, then the problem should be solved.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import chardet

def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = chardet.detect(_content)['encoding']

            if self.encoding:
                _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
                self._content = _content
                self.encoding = 'utf8'

        return _content
    requests.models.Response.content = property(content)

monkey_patch()

passos on 19 May 2017

👍2

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.
>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'
Then, proceed normally.

Thanks for this! Any idea on how to do it in one line?