Pandas: read_csv "CParserError : 데이터 토큰 ν™” 였λ₯˜"ν•„λ“œ μˆ˜κ°€ 가변적 μž„

에 λ§Œλ“  2015λ…„ 10μ›” 31일  Β·  17μ½”λ©˜νŠΈ  Β·  좜처: pandas-dev/pandas

380+ MB csv νŒŒμΌμ„ 읽으렀고 ν•  λ•Œ read_csv (Pandas 0.17.0)에 λ¬Έμ œκ°€ μžˆμŠ΅λ‹ˆλ‹€. νŒŒμΌμ€ 54 개 ν•„λ“œλ‘œ μ‹œμž‘ν•˜μ§€λ§Œ 일뢀 μ€„μ—λŠ” 54 개 λŒ€μ‹  53 개 ν•„λ“œκ°€ μžˆμŠ΅λ‹ˆλ‹€. μ•„λž˜ μ½”λ“œλ₯Ό μ‹€ν–‰ν•˜λ©΄ λ‹€μŒ 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€.

parser = lambda x: datetime.strptime(x, '%y %m %d %H %M %S %f')
df = pd.read_csv(filename,
                         names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                                'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                                'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                                'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                                'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                                'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                        usecols=range(0, 42),
                        parse_dates={"TIMESTAMP": [0, 1, 2, 3, 4, 5, 6]},
                        date_parser=parser,
                        header=None)

였λ₯˜:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

error_bad_lines=False ν‚€μ›Œλ“œλ₯Ό μ „λ‹¬ν•˜λ©΄ μ•„λž˜ μ˜ˆμ™€ μœ μ‚¬ν•œ λ¬Έμ œκ°€μžˆλŠ” 쀄이 ν‘œμ‹œλ©λ‹ˆλ‹€.

Skipping line 1683401: expected 53 fields, saw 54

κ·ΈλŸ¬λ‚˜ μ΄λ²ˆμ—λŠ” λ‹€μŒκ³Ό 같은 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€ (λ˜ν•œ DataFrameμ΄λ‘œλ“œλ˜μ§€ μ•ŠμŒ).

CParserError: Too many columns specified: expected 54 and found 53

engine='python' ν‚€μ›Œλ“œλ₯Ό μ „λ‹¬ν•˜λ©΄ 였λ₯˜κ°€ λ°œμƒν•˜μ§€ μ•Šμ§€λ§Œ 데이터λ₯Ό ꡬ문 λΆ„μ„ν•˜λŠ” 데 μ‹œκ°„μ΄ 많이 κ±Έλ¦½λ‹ˆλ‹€. error_bad_lines=False μ‚¬μš© 여뢀에 따라 였λ₯˜ λ©”μ‹œμ§€μ—μ„œ 53κ³Ό 54κ°€ μ „ν™˜λ©λ‹ˆλ‹€.

IO CSV Usage Question

κ°€μž₯ μœ μš©ν•œ λŒ“κΈ€

이 μ‹œλ„:

  df = pd.read_csv(filename,header=None,error_bad_lines=False)

λͺ¨λ“  17 λŒ“κΈ€

이 였λ₯˜λŠ” λͺ¨λ‘ μ •ν™•ν•©λ‹ˆλ‹€. usecols 및 names μ „λ‹¬ν•˜μ—¬ νŒŒμ„œκ°€ μˆ˜ν–‰ν•˜λŠ” μž‘μ—…μ„ μ œν•œν•©λ‹ˆλ‹€. 이 μž‘μ—…μ„ μˆ˜ν–‰ν•˜μ§€ 말고 νŒŒμ‹± ν•  수 μžˆλŠ”μ§€ ν™•μΈν•˜μ‹­μ‹œμ˜€.

μž¬ν˜„ν•˜λŠ” 파일의 μƒ˜ν”Œ μ—†μ΄λŠ” 이와 같은 것을 μ§„λ‹¨ν•˜κΈ°κ°€ 맀우 μ–΄λ ΅μŠ΅λ‹ˆλ‹€.

pd.show_versions() 도 ν‘œμ‹œ

원본 데이터 파일 μ‚¬μš© :

λ‹€λ₯Έ ν‚€μ›Œλ“œκ°€μ—†λŠ” pd.read_csv(filename) λŠ” 였λ₯˜μ—†μ΄ 데이터λ₯Ό ꡬ문 λΆ„μ„ν•˜λŠ” 것 κ°™μŠ΅λ‹ˆλ‹€. pd.read_csv(filename, header=None) μ—μ„œ λ‹€μŒ 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€.

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

μƒ˜ν”Œ 데이터 μ—†μ΄λŠ” μ§„λ‹¨ν•˜κΈ°κ°€ 맀우 μ–΄λ ΅λ‹€λŠ” 데 μ™„μ „νžˆ λ™μ˜ν–ˆμŠ΅λ‹ˆλ‹€. λͺ‡ 쀄 (μΌλΆ€λŠ” 53 개 ν•„λ“œ, μΌλΆ€λŠ” 54 개)μ΄μžˆλŠ” csv 파일둜 였λ₯˜λ₯Ό μƒμ„±ν•˜λ €κ³  μ‹œλ„ν–ˆμ§€λ§Œ pd.read_csv λŠ” μ˜ˆμƒλŒ€λ‘œ NaN으둜 간격을 채 μ›λ‹ˆλ‹€. usecols 및 header=None μ „λ‹¬ν•˜μ—¬ λ°˜λ³΅ν–ˆμ§€λ§Œ μ—¬μ „νžˆ μž‘λ™ν•©λ‹ˆλ‹€. 원본 νŒŒμΌμ— λͺ¨λ“  였λ₯˜λ₯Ό λ°œμƒμ‹œν‚€λŠ” λ¬Έμ œκ°€μžˆλŠ” 것 κ°™μŠ΅λ‹ˆλ‹€.

pd.show_versions() 좜λ ₯은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.22.1
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

μ—΄ μˆ˜λŠ” 첫 번째 μ€„μ—μ„œ μΆ”λ‘ λ˜λ―€λ‘œ μ˜ˆμƒλ©λ‹ˆλ‹€. names λ₯Ό ν†΅κ³Όν•˜λ©΄μ΄λ₯Ό 결정적인 κΈ°λŠ₯으둜 μ‚¬μš©ν•  κ²ƒμž…λ‹ˆλ‹€.

λ”°λΌμ„œ λ‹€μ–‘ν•œ μ˜΅μ…˜μ„ 계속 μ‹œλ„ν•˜μ‹­μ‹œμ˜€. names 및 usecols μ‹€μ œλ‘œ μ•½κ°„ μ œν•œν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 그것을 읽은 λ‹€μŒ ν•„μš”ν•œ κ²ƒμœΌλ‘œ λ‹€μ‹œ μƒ‰μΈν™”ν•˜λŠ” 것이 더 λ‚˜μ„ 수 μžˆμŠ΅λ‹ˆλ‹€.

engine='python' λ₯Ό μ‚¬μš©ν•˜λ©΄ μ΄μƒν•˜κ²Œλ„ λ”ΈκΎΉμ§ˆμ—†μ΄ DataFrameμ„λ‘œλ“œν•©λ‹ˆλ‹€. λ‹€μŒ 슀 λ‹ˆνŽ«μ„ μ‚¬μš©ν•˜μ—¬ 파일의 처음 3 쀄과 잘λͺ»λœ 쀄 3 개λ₯Ό μΆ”μΆœν–ˆμŠ΅λ‹ˆλ‹€ (였λ₯˜ λ©”μ‹œμ§€μ—μ„œ 쀄 번호λ₯Ό μ–»μ—ˆμŠ΅λ‹ˆλ‹€).

from csv import reader
N = int(input('What line do you need? > '))
with open(filename) as f:
    print(next((x for i, x in enumerate(reader(f)) if i == N), None))

1-3 ν–‰ :

['08', '8', '7', '5', '0', '12', '54', '0', '11', '1', '58', '9', '68', '48.2', '0.756', '11.6', '17.5', '13.3', '4.3', '11.3', '32.2', '6.4', '4.1', '5.6', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '15', '80', '0', '11', '1', '62', '9', '69', '77.8', '3.267', '11.2', '17.7', '14.8', '4.2', '15.2', '29.1', '18.4', '10.0', '18.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '21', '52', '0', '11', '1', '61', '11', '51', '29.4', '0.076', '4.1', '13.8', '8.3', '21.5', '5.3', '3.1', '5.7', '3.0', '6.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

μœ„λ°˜ 라인 :

['09', '9', '15', '22', '46', '9', '51', '0', '11', '1', '57', '9', '70', '36.3', '0.242', '11.8', '16.2', '6.4', '4.1', '5.8', '31.3', '5.5', '3.9', '6.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '25', '31', '0', '11', '1', '70', '9', '73', '67.8', '2.196', '10.4', '17.0', '13.4', '4.4', '12.2', '31.8', '15.6', '4.2', '16.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '28', '41', '0', '11', '1', '70', '5', '22', '7.4', '0.003', '4.0', '13.1', '3.4', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

μ œμ•ˆν•˜μ‹ λŒ€λ‘œ νŒŒμΌμ„ 읽은 λ‹€μŒ DataFrame을 μˆ˜μ •ν•˜κ±°λ‚˜ (μ—΄ 이름 λ³€κ²½, λΆˆν•„μš”ν•œ ν•­λͺ© μ‚­μ œ λ“±) κ°„λ‹¨νžˆ python 엔진을 μ‚¬μš©ν•©λ‹ˆλ‹€ (κΈ΄ 처리 μ‹œκ°„).

μΆ”κ°€ 쑰사에 λ”°λ₯΄λ©΄ λ‹€μŒκ³Ό 같은 λͺ…λ Ή μ‹œν€€μŠ€κ°€ β€‹β€‹μž‘λ™ν•©λ‹ˆλ‹€ (λ°μ΄ν„°μ˜ 첫 번째 쀄이 μ—†μŠ΅λ‹ˆλ‹€- header=None μ‘΄μž¬ν•˜μ§€ μ•Šμ§€λ§Œ μ μ–΄λ„λ‘œλ“œλ©λ‹ˆλ‹€).

df = pd.read_csv(filename, 
                 usecols=range(0, 42))
df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

λ‹€μŒμ€ μž‘λ™ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€ .

df = pd.read_csv(filename,
                 names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                 usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

λ‹€μŒμ€ μž‘λ™ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€ .

df = pd.read_csv(filename,
                 header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

μ‚¬μš© 질문으둜 μ’…λ£Œλ©λ‹ˆλ‹€.

dic_df = _create_init_dic ( "C : / Users / swati / Downloads / VQA-Med 2018 Dataset / c5e905f7-6eb0-4a98-b284-da0729a1caf3_VQAMed2018Train / VQAMed2018Train / VQAMed2018Train-QA.csv")
ParserError : 데이터 토큰 ν™” 였λ₯˜. C 였λ₯˜ : 33 행에 1 개의 ν•„λ“œκ°€ μžˆμ–΄μ•Όν•©λ‹ˆλ‹€. 3 개λ₯Ό λ³΄μ•˜μŠ΅λ‹ˆλ‹€.

이 μ‹œλ„:

  df = pd.read_csv(filename,header=None,error_bad_lines=False)

그것을 μ‹œλ„ν•˜κ³  λΉ„μŠ·ν•œ 였λ₯˜λ₯Ό μœ„ν•΄ μΌν–ˆμŠ΅λ‹ˆλ‹€. κ°μ‚¬ν•©λ‹ˆλ‹€!

읽을 λ•Œ quoting=3 μΆ”κ°€ μ‹œλ„

νŒ¬λ”κ°€ 더 λ§Žμ€ ν•„λ“œκ°€μžˆλŠ” ν–‰μ˜ μΆ”κ°€ ν•„λ“œλ₯Ό λ¬΄μ‹œν•˜λŠ” 방법이 μžˆμŠ΅λ‹ˆκΉŒ?
예 : "1605634 행에 53 개의 ν•„λ“œκ°€ μžˆμ–΄μ•Όν•©λ‹ˆλ‹€. 54 개λ₯Ό λ³΄μ•˜λ‹€"
1605634 쀄에 ν•„λ“œ 54λ₯Ό λ“œλ‘­ν•©λ‹ˆλ‹€.

또 λ‹€λ₯Έ 경우! κ·ΈλŸ¬λ‚˜ "error_bad_lines = False"둜 ν•΄κ²°λ˜μ—ˆμ§€λ§Œ μ—¬μ „νžˆ 였λ₯˜λ₯Ό μΈμ‡„ν•˜μ§€λ§Œ 'μ’…λ£Œ μ½”λ“œ 0'

같은 였λ₯˜κ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€

read_csv λͺ¨λ“œμ—μ„œ κ΅¬λΆ„μž 맀개 λ³€μˆ˜λ₯Ό μΆ”κ°€ν–ˆμŠ΅λ‹ˆλ‹€.

그리고 그것은 μΌν–ˆλ‹€

error_bad_lines = 거짓
νš¨κ³Όκ°€μžˆλ‹€

pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

μ—΄ μˆ˜λŠ” 첫 번째 μ€„μ—μ„œ μΆ”λ‘ λ˜λ―€λ‘œ μ˜ˆμƒλ©λ‹ˆλ‹€. names λ₯Ό ν†΅κ³Όν•˜λ©΄μ΄λ₯Ό 결정적인 κΈ°λŠ₯으둜 μ‚¬μš©ν•  κ²ƒμž…λ‹ˆλ‹€.

λ”°λΌμ„œ λ‹€μ–‘ν•œ μ˜΅μ…˜μ„ 계속 μ‹œλ„ν•˜μ‹­μ‹œμ˜€. names 및 usecols μ‹€μ œλ‘œ μ•½κ°„ μ œν•œν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 그것을 읽은 λ‹€μŒ ν•„μš”ν•œ κ²ƒμœΌλ‘œ λ‹€μ‹œ μƒ‰μΈν™”ν•˜λŠ” 것이 더 λ‚˜μ„ 수 μžˆμŠ΅λ‹ˆλ‹€.

μž‘λ™ν•©λ‹ˆλ‹€! R μ–Έμ–΄λ₯Ό μ‚¬μš©ν•˜μ—¬ csvλ₯Ό μž‘μ„±ν•˜κ³  파이썬으둜 μ½μœΌλ €κ³ ν•©λ‹ˆλ‹€. 첫 번째 μ€„μ—λŠ” λͺ¨λ“  μ€„μ˜ μ΅œλŒ€ 길이가 μžˆμ–΄μ•Όν•©λ‹ˆλ‹€. μ΄λ ‡κ²Œν•˜λ©΄ 잘λͺ»λœ μ€„μ˜ 문제λ₯Ό ν•΄κ²°ν•˜κ³  쀄을 μžƒμ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

pd.read_csv λͺ…령을 μ‚¬μš©ν•˜μ—¬ .xlsx νŒŒμΌμ„ κ°€μ Έ μ˜€λ €κ³ ν•˜λ©΄μ΄ 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€.

pd.read_csv λŒ€μ‹  pd.read_excel을 μ‚¬μš©ν•΄λ³΄μ‹­μ‹œμ˜€.

이λ₯Ό μˆ˜μ •ν•˜λŠ” κ°€μž₯ μ‰¬μš΄ 방법은 CSV νŒŒμΌμ„ Excel 파일둜 λ³€ν™˜ν•˜κ³  데이터 읽기λ₯Ό μœ„ν•΄ pd.read_csv λŒ€μ‹  pd.read_excel을 μ‚¬μš©ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
0 / 5 - 0 λ“±κΈ‰