Requests: Response.text returns improperly decoded text (requests 1.2.3, python 2.7)

Created on 15 Sep 2013  ·  24Comments  ·  Source: psf/requests

If http server returns Content-type: text/* without encoding, Response.text always decode it as 'ISO-8859-1' text.

It may be valid in RFC2616/3.7.1, but this is wrong in real life in 2013.

I made example page with chinese text:
http://lavr.github.io/python-emails/tests/requests/some-utf8-text.html
All browsers renders this page properly.
But reguests.get returns invalid text.

And here is simple test with that url:
https://gist.github.com/lavr/6572927

Most helpful comment

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

All 24 comments

Thanks for this @lavr!

This is a deliberate design decision for Requests. We're following the spec unless we find ourselves in a position where the specification diverges so wildly from real world behaviour that it becomes a problem (e.g. GET after 302 response to a POST).

If the upstream server knows what the correct encoding is, it should signal it. Otherwise, we're going to follow what the spec says. =)

If you think the spec default is a bad one, I highly encourage you to get involved with the RFC process for HTTP/2.0 in order to get this default changed. =)

What @Lukasa said + the fact that if the encoding retrieved from the headers is non-existent we rely on charade to guess at the encoding. With so few characters, charade will not return anything definitive because it uses statistical data to guess at what the right encoding is.

Frankly, the year makes no difference and does not change specification either.

If you know what encoding you're expecting you can also do the decoding yourself like so:

text = str(r.content, '<ENCODING>', errors='replace')

There is nothing wrong with requests as far as I'm concerned and this is not a bug in charade either. Since @Lukasa seems to agree with me, I'm closing this.

@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

@kennethreitz that's disappointing. Why are we making that easy for people? =P

Absolutely :)

Mostly for Japanese websites. They all lie about their encoding.

@sigmavirus24
please note, that utils.get_encoding_from_headers always returns 'ISO-8859-1', and charade has no chance to be called.
so bug is: we expect that charade is used to guess encoding, but it is not.

A patch above fixes a bug, but still follows RFC.
Please, consider to review it.

@lavr Sorry, we didn't make this very clear. We do _not_ expect charade to be called in this case. The RFC is very clear: if you don't specify a charset, and the MIME type is text/*, the encoding must be assumed to be ISO-8859-1. That means "don't guess". =)

@lavr: just set r.encoding to None, and it'll work as you expect (I think).

Or do r.encoding = r.apparent_encoding.

Even better.

On r.encoding = None and r.encoding = r.apparent_encoding we lost server charset information.
Totally ignoring server header is not good solution, I think.

Right solution is something like this:

r = requests.get(...)
params = cgi.parse_header(r.headers.get('content-type'))[0]
server_encoding = ('charset' in params) and params['charset'].strip("'\"") or None
r.encoding = server_encoding or r.apparent_encoding
text = r.text

Looks weird :(

Or do this:

r = requests.get(...)

if r.encoding is None or r.encoding == 'ISO-8859-1':
    r.encoding = r.apparent_encoding

I don't think so :)

Condition r.encoding is None has no sense, because r.encoding can never be None for content-type=text/*.

r.encoding == 'ISO-8859-1'... what does it mean ? Server sent charset='ISO-8859-1' or server sent no charset? If first, I shouldn't guess charset.

@lavr I was covering the non-text bases. You can rule out the charset possibility by using this condition instead:

r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in r.headers.get('Content-Type', '')

@Lukasa
Well, I can use this hack.
And everybody in Eastern Europe and Asia can use it.

But what if we fix it in requests ? ;)
What if requests can honestly set enconding=None on response without charset ?

As we've discussed many times, Requests is following the HTTP specification to the letter. The current behaviour is not wrong. =)

The fact that it is not helpful for your use case is a whole other story. =)

Alright, that's enough discussion on this. Thanks for the feedback.

Updated HTTP 1.1 obsoletes ISO-8859-1 default charset: http://tools.ietf.org/html/rfc7231#appendix-B

We're already tracking this in #2086. =)

To whom it may concern, here is a compatibility patch

create file requests_patch.py with following code and import it, then the problem should be solved.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import chardet

def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = chardet.detect(_content)['encoding']

            if self.encoding:
                _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
                self._content = _content
                self.encoding = 'utf8'

        return _content
    requests.models.Response.content = property(content)

monkey_patch()


@lavr (/cc @sigmavirus24), even easier than that, you can simply provide the encoding yourself.

>>> r = requests.get('http://irresponsible-server/')
>>> r.encoding = 'utf-8'

Then, proceed normally.

Thanks for this! Any idea on how to do it in one line?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cnicodeme picture cnicodeme  ·  3Comments

ghtyrant picture ghtyrant  ·  3Comments

avinassh picture avinassh  ·  4Comments

brainwane picture brainwane  ·  3Comments

ReimarBauer picture ReimarBauer  ·  4Comments