Pandas: OSError when reading file with accents in file path

Created on 9 Jan 2017  ·  27Comments  ·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

test.txt and test_é.txt are the same file, only the name change:

pd.read_csv('test.txt')
Out[3]: 
   1 1 1
0  1 1 1
1  1 1 1

pd.read_csv('test_é.txt')
Traceback (most recent call last):

  File "<ipython-input-4-fd67679d1d17>", line 1, in <module>
    pd.read_csv('test_é.txt')

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 389, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 730, in __init__
    self._make_engine(self.engine)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 923, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 1390, in __init__
    self._reader = _parser.TextReader(src, **kwds)

  File "pandas\parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4184)

  File "pandas\parser.pyx", line 669, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8471)

OSError: Initializing from file failed

Problem description

Pandas return OSError when trying to read a file with accents in file path.

The problem is new (Since I upgraded to Python 3.6 and Pandas 0.19.2)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fr
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.9.3
boto: None
pandas_datareader: None

Bug IO CSV Unicode Windows

Most helpful comment

If anyone comes here like me because he/she hit the same problem, here is a solution until pandas is fixed to work with pep 529 (basically any non ascii chars will in your path or filename will result in errors):

Insert the following two lines at the beginning of your code to revert back to the old way of handling paths on windows:

import sys
sys._enablelegacywindowsfsencoding()

All 27 comments

Just my pennies worth. Quickly tried it out on Mac OSX and Ubuntu with no
problems. See below.

Could this be an environment/platform problem? I noticed that the LOCALE is
set to None.None. Unfortunately I do not have a windows machine to try this
example on. Admittedly this would not explain why you've seen this after the
upgrade to python3.6 and pandas 0.19.2.

Note: I just set up a virtualenv with python3.6 and installed pandas 0.19.2 using pip.

>>> import pandas as pd
>>> pd.read_csv('test_é.txt')
   a  b  c
0  1  2  3
1  4  5  6

Output of pd.show_versions()


INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-57-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.3.1
Cython: None
numpy: 1.11.3
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

I believe 3.6 switches the file system encoding on windows to utf8 (from ascii). Apart from that we don't have testing enable yet on windows for 3.6 (as some of the required packages are just now becoming available).

@JGoutin

so I just added build support on appveyor (windows) for 3.6, so if you'd push up your tests to see if it works, would be great.

I also faced the same problem when the program stopped at pd.read_csv(file_path). The situation is similar to me after I upgraded my python to 3.6 (I'm not sure the last time the python I installed is exactly what version, maybe 3.5......).

@jreback what is the next step towards a fix here?
You have mentioned a PR that got 'blown away' - what does it mean?

While I do not use Windows, I could try to help (just got a VM to debug a piece of my code that apparently does not work on windows)

BTW, a workaround: pass a file handle instead of a name
pd.read_csv(open('test_é.txt', 'r'))
(there are several workarounds in related issues, but I have not seen this one)

@tpietruszka see comments on the PR: https://github.com/pandas-dev/pandas/pull/15092 (it got removed from a private fork, was pretty much there).

you basically need to encode the paths differently on py3.6 (vs other pythons) on wnidows. basically need to implement: https://docs.python.org/3/whatsnew/3.6.html#pep-529-change-windows-filesystem-encoding-to-utf-8

my old code (can't run):

import pandas as pd
import os
file_path='./dict/字典.csv'
df_name = pd.read_csv(file_path,sep=',' )

new code (sucessful):

import pandas as pd
import os
file_path='./dict/dict.csv'
df_name = pd.read_csv(file_path,sep=',' )

I think this bug is filename problem.
I change filename from chinese to english, it can run now.

If anyone comes here like me because he/she hit the same problem, here is a solution until pandas is fixed to work with pep 529 (basically any non ascii chars will in your path or filename will result in errors):

Insert the following two lines at the beginning of your code to revert back to the old way of handling paths on windows:

import sys
sys._enablelegacywindowsfsencoding()

I use the solution above and it works. Thanks very much @fotisj !
However I'm still confused on why DataFrame.to_csv() doesn't occur same problem. In other words, for unicode file path, write is ok, while read isn't.

path=os.path.join('E:\语料','sina.csv')
pd.read_csv(open(path, 'r',encoding='utf8'))

It is successful.

Can someone with an affected system check if changing this line

https://github.com/pandas-dev/pandas/blob/e8620abc12a4c468a75adb8607fd8e0eb1c472e7/pandas/io/common.py#L209

to

 return _expand_user(os.fsencode(filepath_or_buffer)), None, compression

fixes it?

No, it does not.
Results in: OSError: Expected file path name or file-like object, got type
(on Windows 10)

    OSError                                   Traceback (most recent call last)
    <ipython-input-2-e8247998d6d4> in <module>()
      1 
----> 2 df = pd.read_csv(r'D:\mydata\Dropbox\uni\progrs\test öäau\n\teu.csv', sep='\t')

C:\conda\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

C:\conda\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    447 
    448     # Create the parser.
--> 449     parser = TextFileReader(filepath_or_buffer, **kwds)
    450 
    451     if chunksize or iterator:

C:\conda\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    816             self.options['has_index_names'] = kwds['has_index_names']
    817 
--> 818         self._make_engine(self.engine)
    819 
    820     def close(self):

C:\conda\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
   1047     def _make_engine(self, engine='c'):
   1048         if engine == 'c':
-> 1049             self._engine = CParserWrapper(self.f, **self.options)
   1050         else:
   1051             if engine == 'python':

C:\conda\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1693         kwds['allow_leading_cols'] = self.index_col is not False
   1694 
-> 1695         self._reader = parsers.TextReader(src, **kwds)
   1696 
   1697         # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

OSError: Expected file path name or file-like object, got <class 'bytes'> type

Oh, sorry. Does fsdecode work there?


From: Fotis Jannidis notifications@github.com
Sent: Saturday, February 3, 2018 8:00:13 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Comment
Subject: Re: [pandas-dev/pandas] OSError when reading file with accents in file path (#15086)

No, it does not.
Results in: OSError: Expected file path name or file-like object, got type


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/15086#issuecomment-362809602, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIplv8thHxpjsP3knUCpET0Fjy0kLks5tRGZsgaJpZM4LeTSB.

No. Using fsdecode produces the same error we originally had (error_msg.txt)

Ok thanks for trying.


From: Fotis Jannidis notifications@github.com
Sent: Saturday, February 3, 2018 8:57:07 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Comment
Subject: Re: [pandas-dev/pandas] OSError when reading file with accents in file path (#15086)

No. Using fsdecode produces the same error we originally had (error_msg.txthttps://github.com/pandas-dev/pandas/files/1691837/error_msg.txt)


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/15086#issuecomment-362818153, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIpeYsj9Bv3OsoHAsOufXzU3AYSBSks5tRHPCgaJpZM4LeTSB.

Talked with Steve Dower today, and he suspects this may be the problematic line: https://github.com/pandas-dev/pandas/blob/e8f206d8192b409bc39da1ba1b2c5bcd8b65cc9f/pandas/_libs/src/parser/io.c#L30

IIUC, the Windows filesystem API is expecting those bytes to be in the MBCS, but we're using utf-8.

A user-level workaround is to explicitly encode your filename as mbcs before passing the bytestring to pandas. https://www.python.org/dev/peps/pep-0529/#explicitly-using-mbcs

pd.read_csv(filename.encode('mbcs'))

is anyone able to test out that workaround?

just need a small change in the parser code to fix this (there was a PR doing this) but was deleted

@TomAugspurger that does not work. read_csv expects a str and not a bytes value. It fails with

OSError: Expected file path name or file-like object, got <class 'bytes'> type

Thanks for checking.

On Fri, Apr 20, 2018 at 3:43 PM, João D. Ferreira notifications@github.com
wrote:

@TomAugspurger https://github.com/TomAugspurger that does not work.
read_csv expects a str and not a bytes value. It fails with

OSError: Expected file path name or file-like object, got type


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/15086#issuecomment-383217062,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIiOHyt3sT7B0pHJuY5lB-cJtT5JHks5tqkiEgaJpZM4LeTSB
.

Just pinging this - I have the same issue, I'm using a workaround but it would be great if that was not required.

this needs a community patch

I am encountering this issue. I want to try and contribute a patchc Any pointers on how to start fixing this?

I think none of the maintainers have access to a system that can reproduce this.

Perhaps some of the others in this issue can help put together a solution.

Hi, I have this problem on pandas 1.0.3 now and sys._enablelegacywindowsfsencoding() workaround stopped working. I have ą and ź in file path.
I get this error also on pandas 0.25.3 but 0.23.4 seems to be working fine when using the workaround (I didn't check other versions). I'd be happy to provide any additional information.

Remove file from same folder name like ,if your file stored in same folder name as file.
Just remove file from that folder.
don't store file in same folder name.
then,it works

@pranjulknit If I understand you suggest to move the file to a folder without these problematic characters in the path. This is not always possible. If you suggest that folder names and file names should be different - this is not the issue that is described here, I never had problems with that.

Actually, i have this problem while reading csv file from jupyter notebook.

Was this page helpful?
0 / 5 - 0 ratings