Pandas: OSError when reading file with accents in file path

Created on 9 Jan 2017 · 27Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

test.txt and test_é.txt are the same file, only the name change:

pd.read_csv('test.txt')
Out[3]: 
   1 1 1
0  1 1 1
1  1 1 1

pd.read_csv('test_é.txt')
Traceback (most recent call last):

  File "<ipython-input-4-fd67679d1d17>", line 1, in <module>
    pd.read_csv('test_é.txt')

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 389, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 730, in __init__
    self._make_engine(self.engine)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 923, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 1390, in __init__
    self._reader = _parser.TextReader(src, **kwds)

  File "pandas\parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4184)

  File "pandas\parser.pyx", line 669, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8471)

OSError: Initializing from file failed

Problem description

Pandas return OSError when trying to read a file with accents in file path.

The problem is new (Since I upgraded to Python 3.6 and Pandas 0.19.2)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fr
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.9.3
boto: None
pandas_datareader: None

Bug IO CSV Unicode Windows

Source

JGoutin

👍1

Most helpful comment

If anyone comes here like me because he/she hit the same problem, here is a solution until pandas is fixed to work with pep 529 (basically any non ascii chars will in your path or filename will result in errors):

Insert the following two lines at the beginning of your code to revert back to the old way of handling paths on windows:

import sys
sys._enablelegacywindowsfsencoding()

fotisj on 14 Jan 2018

👍24 ❤4

All 27 comments

Just my pennies worth. Quickly tried it out on Mac OSX and Ubuntu with no
problems. See below.

Could this be an environment/platform problem? I noticed that the LOCALE is
set to None.None. Unfortunately I do not have a windows machine to try this
example on. Admittedly this would not explain why you've seen this after the
upgrade to python3.6 and pandas 0.19.2.

Note: I just set up a virtualenv with python3.6 and installed pandas 0.19.2 using pip.

>>> import pandas as pd
>>> pd.read_csv('test_é.txt')
   a  b  c
0  1  2  3
1  4  5  6

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-57-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.3.1
Cython: None
numpy: 1.11.3
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

m-charlton on 9 Jan 2017

I believe 3.6 switches the file system encoding on windows to utf8 (from ascii). Apart from that we don't have testing enable yet on windows for 3.6 (as some of the required packages are just now becoming available).

jreback on 9 Jan 2017

@JGoutin

so I just added build support on appveyor (windows) for 3.6, so if you'd push up your tests to see if it works, would be great.

jreback on 9 Jan 2017

I also faced the same problem when the program stopped at pd.read_csv(file_path). The situation is similar to me after I upgraded my python to 3.6 (I'm not sure the last time the python I installed is exactly what version, maybe 3.5......).

z94624 on 16 Jul 2017

👍4

@jreback what is the next step towards a fix here?
You have mentioned a PR that got 'blown away' - what does it mean?

While I do not use Windows, I could try to help (just got a VM to debug a piece of my code that apparently does not work on windows)

BTW, a workaround: pass a file handle instead of a name
pd.read_csv(open('test_é.txt', 'r'))
(there are several workarounds in related issues, but I have not seen this one)

tpietruszka on 23 Aug 2017

👍11

@tpietruszka see comments on the PR: https://github.com/pandas-dev/pandas/pull/15092 (it got removed from a private fork, was pretty much there).

you basically need to encode the paths differently on py3.6 (vs other pythons) on wnidows. basically need to implement: https://docs.python.org/3/whatsnew/3.6.html#pep-529-change-windows-filesystem-encoding-to-utf-8

jreback on 24 Aug 2017

my old code (can't run):

import pandas as pd
import os
file_path='./dict/字典.csv'
df_name = pd.read_csv(file_path,sep=',' )

new code (sucessful):

import pandas as pd
import os
file_path='./dict/dict.csv'
df_name = pd.read_csv(file_path,sep=',' )

I think this bug is filename problem.
I change filename from chinese to english, it can run now.

dondon2475848 on 29 Aug 2017

👍5 😄1

Insert the following two lines at the beginning of your code to revert back to the old way of handling paths on windows:

import sys
sys._enablelegacywindowsfsencoding()

fotisj on 14 Jan 2018

👍24 ❤4

I use the solution above and it works. Thanks very much @fotisj !
However I'm still confused on why DataFrame.to_csv() doesn't occur same problem. In other words, for unicode file path, write is ok, while read isn't.

ColdHumour on 21 Jan 2018

path=os.path.join('E:\语料','sina.csv')
pd.read_csv(open(path, 'r',encoding='utf8'))

It is successful.

GitOffice on 3 Feb 2018

❤1 👍1

Can someone with an affected system check if changing this line

https://github.com/pandas-dev/pandas/blob/e8620abc12a4c468a75adb8607fd8e0eb1c472e7/pandas/io/common.py#L209

 return _expand_user(os.fsencode(filepath_or_buffer)), None, compression

fixes it?

TomAugspurger on 3 Feb 2018

No, it does not.
Results in: OSError: Expected file path name or file-like object, got type
(on Windows 10)

    OSError                                   Traceback (most recent call last)
    <ipython-input-2-e8247998d6d4> in <module>()
      1 
----> 2 df = pd.read_csv(r'D:\mydata\Dropbox\uni\progrs\test öäau\n\teu.csv', sep='\t')

C:\conda\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

C:\conda\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    447 
    448     # Create the parser.
--> 449     parser = TextFileReader(filepath_or_buffer, **kwds)
    450 
    451     if chunksize or iterator:

C:\conda\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    816             self.options['has_index_names'] = kwds['has_index_names']
    817 
--> 818         self._make_engine(self.engine)
    819 
    820     def close(self):

C:\conda\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
   1047     def _make_engine(self, engine='c'):
   1048         if engine == 'c':
-> 1049             self._engine = CParserWrapper(self.f, **self.options)
   1050         else:
   1051             if engine == 'python':

C:\conda\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1693         kwds['allow_leading_cols'] = self.index_col is not False
   1694 
-> 1695         self._reader = parsers.TextReader(src, **kwds)
   1696 
   1697         # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

OSError: Expected file path name or file-like object, got <class 'bytes'> type

fotisj on 3 Feb 2018

Oh, sorry. Does fsdecode work there?

From: Fotis Jannidis notifications@github.com
Sent: Saturday, February 3, 2018 8:00:13 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Comment
Subject: Re: [pandas-dev/pandas] OSError when reading file with accents in file path (#15086)

No, it does not.
Results in: OSError: Expected file path name or file-like object, got type

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/15086#issuecomment-362809602, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIplv8thHxpjsP3knUCpET0Fjy0kLks5tRGZsgaJpZM4LeTSB.

TomAugspurger on 3 Feb 2018

No. Using fsdecode produces the same error we originally had (error_msg.txt)

fotisj on 3 Feb 2018

Ok thanks for trying.

From: Fotis Jannidis notifications@github.com
Sent: Saturday, February 3, 2018 8:57:07 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Comment
Subject: Re: [pandas-dev/pandas] OSError when reading file with accents in file path (#15086)

No. Using fsdecode produces the same error we originally had (error_msg.txthttps://github.com/pandas-dev/pandas/files/1691837/error_msg.txt)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/15086#issuecomment-362818153, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIpeYsj9Bv3OsoHAsOufXzU3AYSBSks5tRHPCgaJpZM4LeTSB.

TomAugspurger on 3 Feb 2018

Talked with Steve Dower today, and he suspects this may be the problematic line: https://github.com/pandas-dev/pandas/blob/e8f206d8192b409bc39da1ba1b2c5bcd8b65cc9f/pandas/_libs/src/parser/io.c#L30

IIUC, the Windows filesystem API is expecting those bytes to be in the MBCS, but we're using utf-8.

A user-level workaround is to explicitly encode your filename as mbcs before passing the bytestring to pandas. https://www.python.org/dev/peps/pep-0529/#explicitly-using-mbcs

pd.read_csv(filename.encode('mbcs'))

is anyone able to test out that workaround?

TomAugspurger on 9 Apr 2018

just need a small change in the parser code to fix this (there was a PR doing this) but was deleted

jreback on 9 Apr 2018

@TomAugspurger that does not work. read_csv expects a str and not a bytes value. It fails with

OSError: Expected file path name or file-like object, got <class 'bytes'> type

jdferreira on 20 Apr 2018

Thanks for checking.

On Fri, Apr 20, 2018 at 3:43 PM, João D. Ferreira notifications@github.com
wrote:

@TomAugspurger https://github.com/TomAugspurger that does not work.
read_csv expects a str and not a bytes value. It fails with

OSError: Expected file path name or file-like object, got type

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/15086#issuecomment-383217062,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIiOHyt3sT7B0pHJuY5lB-cJtT5JHks5tqkiEgaJpZM4LeTSB
.

TomAugspurger on 21 Apr 2018

Just pinging this - I have the same issue, I'm using a workaround but it would be great if that was not required.

mmagnuski on 21 Nov 2018

this needs a community patch

jreback on 21 Nov 2018

I am encountering this issue. I want to try and contribute a patchc Any pointers on how to start fixing this?

kchawla-pi on 10 Dec 2018

I think none of the maintainers have access to a system that can reproduce this.

Perhaps some of the others in this issue can help put together a solution.

TomAugspurger on 10 Dec 2018

Hi, I have this problem on pandas 1.0.3 now and sys._enablelegacywindowsfsencoding() workaround stopped working. I have ą and ź in file path.
I get this error also on pandas 0.25.3 but 0.23.4 seems to be working fine when using the workaround (I didn't check other versions). I'd be happy to provide any additional information.

mmagnuski on 26 Apr 2020

Remove file from same folder name like ,if your file stored in same folder name as file.
Just remove file from that folder.
don't store file in same folder name.
then,it works

pranjulknit on 19 Oct 2020

@pranjulknit If I understand you suggest to move the file to a folder without these problematic characters in the path. This is not always possible. If you suggest that folder names and file names should be different - this is not the issue that is described here, I never had problems with that.

mmagnuski on 19 Oct 2020

Actually, i have this problem while reading csv file from jupyter notebook.

pranjulknit on 19 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings