Pandas: read_csv "CParserError: Error tokenizing data" with variable number of fields

Created on 31 Oct 2015  ·  17Comments  ·  Source: pandas-dev/pandas

I am having trouble with read_csv (Pandas 0.17.0) when trying to read a 380+ MB csv file. The file starts with 54 fields but some lines have 53 fields instead of 54. Running the below code gives me the following error:

parser = lambda x: datetime.strptime(x, '%y %m %d %H %M %S %f')
df = pd.read_csv(filename,
                         names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                                'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                                'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                                'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                                'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                                'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                        usecols=range(0, 42),
                        parse_dates={"TIMESTAMP": [0, 1, 2, 3, 4, 5, 6]},
                        date_parser=parser,
                        header=None)

Error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

If I pass the error_bad_lines=False keyword, problematic lines are displayed similar to the example below:

Skipping line 1683401: expected 53 fields, saw 54

however I get the following error this time ( also the DataFrame does not get loaded):

CParserError: Too many columns specified: expected 54 and found 53

If I pass the engine='python' keyword, I do not get any errors, but it takes a really long time to parse the data. Please note that 53 and 54 are switched in the error messages depending on if error_bad_lines=False is used or not.

IO CSV Usage Question

Most helpful comment

Try this:

  df = pd.read_csv(filename,header=None,error_bad_lines=False)

All 17 comments

these errors are all correct. you are constraining what the parser is doing by passing usecols, and names. don't do this and see if you can parse it.

very hard to diagnose something like this without a sample of the file that reproduces.

also show pd.show_versions()

With the original data file:

pd.read_csv(filename) with no other keywords seems to parse the data with no errors. pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Totally agreed that it is very hard to diagnose without the sample data. I tried generating the error with a csv file with a few lines (some has 53 fields, some 54), pd.read_csv fills the gaps with NaNs as expected. I repeated by passing usecols and header=None, still works. It seems the original file has some kind of an issue that raises all the errors.

pd.show_versions() output as follows:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.22.1
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

is expected as the number of columns is inferred from the first line. If you pass names if will use this as a determining feature.

So keep trying various options. You are constraining it a bit much actually with names and usecols. You might be better off reading it in, then reindexing to what you need.

If engine='python' is used, curiously, it loads the DataFrame without any hiccups though. I used the following snippet to extract the first 3 lines in the file and 3 of the offending lines (got the line numbers from the error message).

from csv import reader
N = int(input('What line do you need? > '))
with open(filename) as f:
    print(next((x for i, x in enumerate(reader(f)) if i == N), None))

Lines 1-3:

['08', '8', '7', '5', '0', '12', '54', '0', '11', '1', '58', '9', '68', '48.2', '0.756', '11.6', '17.5', '13.3', '4.3', '11.3', '32.2', '6.4', '4.1', '5.6', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '15', '80', '0', '11', '1', '62', '9', '69', '77.8', '3.267', '11.2', '17.7', '14.8', '4.2', '15.2', '29.1', '18.4', '10.0', '18.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '21', '52', '0', '11', '1', '61', '11', '51', '29.4', '0.076', '4.1', '13.8', '8.3', '21.5', '5.3', '3.1', '5.7', '3.0', '6.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

Offending lines:

['09', '9', '15', '22', '46', '9', '51', '0', '11', '1', '57', '9', '70', '36.3', '0.242', '11.8', '16.2', '6.4', '4.1', '5.8', '31.3', '5.5', '3.9', '6.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '25', '31', '0', '11', '1', '70', '9', '73', '67.8', '2.196', '10.4', '17.0', '13.4', '4.4', '12.2', '31.8', '15.6', '4.2', '16.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '28', '41', '0', '11', '1', '70', '5', '22', '7.4', '0.003', '4.0', '13.1', '3.4', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

As you suggested, I will try to read the file, then modify the DataFrame (rename columns, delete unnecessary ones etc.) or simply use the python engine (long processing time).

Per further investigation, following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, 
                 usecols=range(0, 42))
df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename,
                 names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                 usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Following does NOT work:

df = pd.read_csv(filename,
                 header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

closing as usage question.

dic_df=_create_init_dic("C:/Users/swati/Downloads/VQA-Med 2018 Dataset/c5e905f7-6eb0-4a98-b284-da0729a1caf3_VQAMed2018Train/VQAMed2018Train/VQAMed2018Train-QA.csv")
ParserError: Error tokenizing data. C error: Expected 1 fields in line 33, saw 3

Try this:

  df = pd.read_csv(filename,header=None,error_bad_lines=False)

Tried it and worked for a similar error thank you!

try add quoting=3 when read

is there a way for pandas to just ignore the extra fields in any row that has more fields ?
for example in case "Expected 53 fields in line 1605634, saw 54"
it just drop field 54 in line 1605634

Another case! but resolved with "error_bad_lines=False", it still prints the error but 'exit code 0'

I got the same error

I just add the delimiter parameter in read_csv mode

and it worked

error_bad_lines=False
it works

pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

is expected as the number of columns is inferred from the first line. If you pass names if will use this as a determining feature.

So keep trying various options. You are constraining it a bit much actually with names and usecols. You might be better off reading it in, then reindexing to what you need.

This works! I write csv using R language and try to read it in python. The first line should have the maximum length for all lines. This way will fix the problem of bad lines and will not lose any lines.

If you will try to import .xlsx file using the command pd.read_csv, you will get this error.

Try using pd.read_excel instead of pd.read_csv

The easiest way to fix it is to transform your CSV file to Excel file and use pd.read_excel instead of pd.read_csv for read the data

Was this page helpful?
0 / 5 - 0 ratings