Pandas: read_csv C-engine CParserError: Error tokenizing data

Created on 22 Sep 2015  ·  15Comments  ·  Source: pandas-dev/pandas

Hi,

I have encountered a dataset where the C-engine read_csv has problems. I am unsure of the exact issue but I have narrowed it down to a single row which I have pickled and uploaded it to dropbox. If you obtain the pickle try the following:

df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')

I get the following exception:

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

If you try and read the CSV using the python engine then no exception is thrown:

df.read_csv('faulty_row.csv', encoding='utf8', engine='python')

Suggesting that the issue is with read_csv and not to_csv. The versions I using are:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
Bug IO CSV Needs Info

Most helpful comment

I missed @alfonsomhc answer because it just looked like a comment.

You need

df = pd.read_csv('test.csv', engine='python')

All 15 comments

Your second-to-last line includes an '\r' break. I think it's a bug, but one workaround is to open in universal-new-line mode.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='c')

I'm encountering this error as well. Using the method suggested by @chris-b1 causes the following error:

Traceback (most recent call last):
  File "C:/Users/je/Desktop/Python/comparison.py", line 30, in <module>
    encoding='utf-8', engine='c')
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas\parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4948)
  File "pandas\parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas\parser.c:7386)
  File "pandas\parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)
  File "pandas\parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas\parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

+1

I have also found this issue when reading a large csv file with the default egine. If I use engine='python' then it works fine.

I missed @alfonsomhc answer because it just looked like a comment.

You need

df = pd.read_csv('test.csv', engine='python')

had the same issue trying to read a folder not a csv file

Has anyone investigated this issue? It's killing performance when using read_csv in a keras generator.

The original data provided is no longer available so the issue is not reproducible. Closing as it's not clear what the issue is, but @dgrahn or anyone else if you can provide a reproducible example we can reopen

@WillAyd Let me know if you need additional info.

Since GitHub doesn't accept CSVs, I changed the extension to .txt.
Here's the code which will trigger the exception.

for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)):
    pass

Here's the file: debug.txt

Here's the exception from Windows 10, using Anaconda.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

And the same on RedHat.

$ python3
Python 3.6.6 (default, Aug 13 2018, 18:24:23)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

@dgrahn I have downloaded debug.txt and I get the following if you run pd.read_csv('debug.xt', header=None) on a mac:

ParserError: Error tokenizing data. C error: Expected 204 fields in line 3, saw 2504

Which is different from the Buffer overflow caught error originally described.

I have inspected the debug.txt file and the first two lines have 204 columns but the 3rd line has 2504 columns. This would make the file unparsable and explains why an error is thrown.

Is this expected? GitHub could be doing some implicit conversion in the background between newline types ("\r\n" and "\n") that is messing up the uploaded example.

@joshlk Did you use the names=range(2504) option as described in the comment above?

@dgrahn good point.

Ok can now reproduce the error with pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)).

It's good to note that pandas.read_csv('debug.csv', names=range(2504)) works fine and so its then unlikely related to the original bug but it is producing the same symptom.

@joshlk I could open a separate issue if that would be preferred.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='python')

Solved my problem.

engine='python'

I tried this approach and was able to upload large data files. But when I checked the dimension of the dataframe I saw that the number of rows have increased. What can be the logical regions for that?

Was this page helpful?
0 / 5 - 0 ratings