Pandas: read_csv "CParserError: Error tokenizing data" with variable number of fields

Created on 31 Oct 2015 · 17Comments · Source: pandas-dev/pandas

I am having trouble with read_csv (Pandas 0.17.0) when trying to read a 380+ MB csv file. The file starts with 54 fields but some lines have 53 fields instead of 54. Running the below code gives me the following error:

parser = lambda x: datetime.strptime(x, '%y %m %d %H %M %S %f')
df = pd.read_csv(filename,
                         names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                                'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                                'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                                'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                                'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                                'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                        usecols=range(0, 42),
                        parse_dates={"TIMESTAMP": [0, 1, 2, 3, 4, 5, 6]},
                        date_parser=parser,
                        header=None)

Error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

If I pass the error_bad_lines=False keyword, problematic lines are displayed similar to the example below:

Skipping line 1683401: expected 53 fields, saw 54

however I get the following error this time ( also the DataFrame does not get loaded):

CParserError: Too many columns specified: expected 54 and found 53

If I pass the engine='python' keyword, I do not get any errors, but it takes a really long time to parse the data. Please note that 53 and 54 are switched in the error messages depending on if error_bad_lines=False is used or not.

IO CSV Usage Question

Source

ekinsenturk

Most helpful comment

Try this:

  df = pd.read_csv(filename,header=None,error_bad_lines=False)

antonyj453 on 11 Mar 2019

👍16

All 17 comments

these errors are all correct. you are constraining what the parser is doing by passing usecols, and names. don't do this and see if you can parse it.

very hard to diagnose something like this without a sample of the file that reproduces.

also show pd.show_versions()

jreback on 31 Oct 2015

With the original data file:

pd.read_csv(filename) with no other keywords seems to parse the data with no errors. pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Totally agreed that it is very hard to diagnose without the sample data. I tried generating the error with a csv file with a few lines (some has 53 fields, some 54), pd.read_csv fills the gaps with NaNs as expected. I repeated by passing usecols and header=None, still works. It seems the original file has some kind of an issue that raises all the errors.

pd.show_versions() output as follows:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.22.1
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

ekinsenturk on 31 Oct 2015

pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

is expected as the number of columns is inferred from the first line. If you pass names if will use this as a determining feature.

So keep trying various options. You are constraining it a bit much actually with names and usecols. You might be better off reading it in, then reindexing to what you need.

jreback on 31 Oct 2015

👍2

If engine='python' is used, curiously, it loads the DataFrame without any hiccups though. I used the following snippet to extract the first 3 lines in the file and 3 of the offending lines (got the line numbers from the error message).

from csv import reader
N = int(input('What line do you need? > '))
with open(filename) as f:
    print(next((x for i, x in enumerate(reader(f)) if i == N), None))

Lines 1-3:

['08', '8', '7', '5', '0', '12', '54', '0', '11', '1', '58', '9', '68', '48.2', '0.756', '11.6', '17.5', '13.3', '4.3', '11.3', '32.2', '6.4', '4.1', '5.6', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '15', '80', '0', '11', '1', '62', '9', '69', '77.8', '3.267', '11.2', '17.7', '14.8', '4.2', '15.2', '29.1', '18.4', '10.0', '18.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '21', '52', '0', '11', '1', '61', '11', '51', '29.4', '0.076', '4.1', '13.8', '8.3', '21.5', '5.3', '3.1', '5.7', '3.0', '6.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

Offending lines:

['09', '9', '15', '22', '46', '9', '51', '0', '11', '1', '57', '9', '70', '36.3', '0.242', '11.8', '16.2', '6.4', '4.1', '5.8', '31.3', '5.5', '3.9', '6.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '25', '31', '0', '11', '1', '70', '9', '73', '67.8', '2.196', '10.4', '17.0', '13.4', '4.4', '12.2', '31.8', '15.6', '4.2', '16.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '28', '41', '0', '11', '1', '70', '5', '22', '7.4', '0.003', '4.0', '13.1', '3.4', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

As you suggested, I will try to read the file, then modify the DataFrame (rename columns, delete unnecessary ones etc.) or simply use the python engine (long processing time).

ekinsenturk on 31 Oct 2015

👍2 🎉1

Per further investigation, following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, 
                 usecols=range(0, 42))
df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename,
                 names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                 usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Following does NOT work:

df = pd.read_csv(filename,
                 header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

ekinsenturk on 31 Oct 2015

👍9

closing as usage question.

jreback on 1 Nov 2015

dic_df=_create_init_dic("C:/Users/swati/Downloads/VQA-Med 2018 Dataset/c5e905f7-6eb0-4a98-b284-da0729a1caf3_VQAMed2018Train/VQAMed2018Train/VQAMed2018Train-QA.csv")
ParserError: Error tokenizing data. C error: Expected 1 fields in line 33, saw 3

swaeety on 16 Jan 2019

Try this:

  df = pd.read_csv(filename,header=None,error_bad_lines=False)

antonyj453 on 11 Mar 2019

👍16

Tried it and worked for a similar error thank you!

rahlouni on 5 Dec 2019

try add quoting=3 when read

peter-wang-wsl on 31 Dec 2019

is there a way for pandas to just ignore the extra fields in any row that has more fields ?
for example in case "Expected 53 fields in line 1605634, saw 54"
it just drop field 54 in line 1605634

amm123 on 7 Feb 2020

Another case! but resolved with "error_bad_lines=False", it still prints the error but 'exit code 0'

manel00 on 10 Feb 2020

I got the same error

I just add the delimiter parameter in read_csv mode

and it worked

svnsatyasai on 18 Mar 2020

👍2

error_bad_lines=False
it works

monti777777 on 6 Apr 2020

👍2

pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
is expected as the number of columns is inferred from the first line. If you pass names if will use this as a determining feature.

So keep trying various options. You are constraining it a bit much actually with names and usecols. You might be better off reading it in, then reindexing to what you need.

This works! I write csv using R language and try to read it in python. The first line should have the maximum length for all lines. This way will fix the problem of bad lines and will not lose any lines.