Pandas: read_csv(filename_with_asian_locale) failed in python 3.6 for windows

Created on 5 Jun 2017 · 3Comments · Source: pandas-dev/pandas

Code:

Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32
>>> pd.__version__
'0.20.1'
>>> import platform
>>> platform.platform()
'Windows-7-6.1.7601-SP1'
>>> import pandas as pd
>>> df = pd.read_csv(r'c:\tmp\中文.csv')
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-0cd6317422e5>", line 1, in <module>
    df = pd.read_csv(r'c:\tmp\中文.csv')
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 405, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 762, in __init__
    self._make_engine(self.engine)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 966, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1582, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 394, in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)
  File "pandas\_libs\parsers.pyx", line 712, in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8895)
OSError: Initializing from file failed

Problem description

python 3.6 changed sys.getfilesystemencoding() to return "utf-8" instead of "mbcs"
see PEP 529.

How to fix

Here is the problem: parsers.pyx

if isinstance(source, basestring):
     if not isinstance(source, bytes):
         source = source.encode(sys.getfilesystemencoding() or 'utf-8')

the source parameter is our filename, and will be encoded to 'utf-8', not legacy 'mbcs' in python 3.6
and finally passed to open() in io.c:new_file_source
thus interpreted as a mbcs string, so, the "File not found" exception is not suprised
maybe this should be the responsiblity of cython for python 3.6 to handle these things by using unicode version of windows API,
but for now, we just replace sys.getfilesystemencoding() to "mbcs"

Duplicate IO CSV Unicode

Source