Pandas: read_csv C-engine CParserError : 데이터 토큰 ν™” 였λ₯˜

에 λ§Œλ“  2015λ…„ 09μ›” 22일  Β·  15μ½”λ©˜νŠΈ  Β·  좜처: pandas-dev/pandas

μ•ˆλ…•ν•˜μ„Έμš”,

C-engine read_csv에 λ¬Έμ œκ°€μžˆλŠ” 데이터 μ„ΈνŠΈκ°€ μžˆμŠ΅λ‹ˆλ‹€. μ •ν™•ν•œ λ¬Έμ œλŠ” ν™•μ‹€ν•˜μ§€ μ•Šμ§€λ§Œ ν•œ μ€„λ‘œ μ’ν˜€μ„œ dropbox에 μ—…λ‘œλ“œν–ˆμŠ΅λ‹ˆλ‹€ . 피클을 μ–»μœΌλ©΄ λ‹€μŒμ„ μ‹œλ„ν•˜μ‹­μ‹œμ˜€.

df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')

λ‹€μŒ μ˜ˆμ™Έκ°€ λ°œμƒν•©λ‹ˆλ‹€.

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

파이썬 엔진을 μ‚¬μš©ν•˜μ—¬ CSVλ₯Ό μ½μœΌλ €κ³ ν•˜λ©΄ μ˜ˆμ™Έκ°€ λ°œμƒν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

df.read_csv('faulty_row.csv', encoding='utf8', engine='python')

λ¬Έμ œκ°€ to_csvκ°€ μ•„λ‹Œ read_csv에 μžˆλ‹€κ³  μ œμ•ˆν•©λ‹ˆλ‹€. λ‚΄κ°€ μ‚¬μš©ν•˜λŠ” 버전은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
Bug IO CSV Needs Info

κ°€μž₯ μœ μš©ν•œ λŒ“κΈ€

λŒ“κΈ€μ²˜λŸΌ λ³΄μ˜€κΈ° λ•Œλ¬Έμ— @alfonsomhc 닡변을 λ†“μ³€μŠ΅λ‹ˆλ‹€.

당신은 ν•„μš”ν•©λ‹ˆλ‹€

df = pd.read_csv('test.csv', engine='python')

λͺ¨λ“  15 λŒ“κΈ€

두 λ²ˆμ§Έμ—μ„œ λ§ˆμ§€λ§‰ μ€„μ—λŠ” '\r' νœ΄μ‹μ΄ ν¬ν•¨λ©λ‹ˆλ‹€. 버그라고 μƒκ°ν•˜μ§€λ§Œ ν•œ 가지 ν•΄κ²° 방법은 λ²”μš© κ°œν–‰ λͺ¨λ“œλ‘œ μ—¬λŠ” κ²ƒμž…λ‹ˆλ‹€.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='c')

이 였λ₯˜λ„ λ°œμƒν•©λ‹ˆλ‹€. @ chris-b1μ—μ„œ μ œμ•ˆν•œ 방법을 μ‚¬μš©ν•˜λ©΄ λ‹€μŒ 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€.

Traceback (most recent call last):
  File "C:/Users/je/Desktop/Python/comparison.py", line 30, in <module>
    encoding='utf-8', engine='c')
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas\parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4948)
  File "pandas\parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas\parser.c:7386)
  File "pandas\parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)
  File "pandas\parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas\parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

+1

κΈ°λ³Έ egine으둜 큰 csv νŒŒμΌμ„ 읽을 λ•Œλ„μ΄ 문제λ₯Ό λ°œκ²¬ν–ˆμŠ΅λ‹ˆλ‹€. engine = 'python'을 μ‚¬μš©ν•˜λ©΄ 잘 μž‘λ™ν•©λ‹ˆλ‹€.

λŒ“κΈ€μ²˜λŸΌ λ³΄μ˜€κΈ° λ•Œλ¬Έμ— @alfonsomhc 닡변을 λ†“μ³€μŠ΅λ‹ˆλ‹€.

당신은 ν•„μš”ν•©λ‹ˆλ‹€

df = pd.read_csv('test.csv', engine='python')

csv 파일이 μ•„λ‹Œ 폴더λ₯Ό 읽으렀고 ν•  λ•Œ λ™μΌν•œ λ¬Έμ œκ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€.

이 문제λ₯Ό μ‘°μ‚¬ν•œ μ‚¬λžŒμ΄ μžˆμŠ΅λ‹ˆκΉŒ? keras μƒμ„±κΈ°μ—μ„œ read_csvλ₯Ό μ‚¬μš©ν•  λ•Œ μ„±λŠ₯이 μ €ν•˜λ©λ‹ˆλ‹€.

제곡된 원본 λ°μ΄ν„°λŠ” 더 이상 μ‚¬μš©ν•  수 μ—†μœΌλ―€λ‘œ 문제λ₯Ό μž¬ν˜„ ν•  수 μ—†μŠ΅λ‹ˆλ‹€. λ¬Έμ œκ°€ 무엇인지 λͺ…ν™•ν•˜μ§€ μ•Šμ•„ λ§ˆλ¬΄λ¦¬ν•˜μ§€λ§Œ @dgrahn λ˜λŠ” λ‹€λ₯Έ μ‚¬λžŒμ΄ μž¬ν˜„ κ°€λŠ₯ν•œ 예제λ₯Ό 제곡 ν•  수 있으면 λ‹€μ‹œ μ—΄ 수 μžˆμŠ΅λ‹ˆλ‹€.

@WillAyd μΆ”κ°€ 정보가 ν•„μš”ν•˜λ©΄ μ•Œλ €μ£Όμ„Έμš”.

GitHubλŠ” CSVλ₯Ό ν—ˆμš©ν•˜μ§€ μ•ŠκΈ° λ•Œλ¬Έμ— ν™•μž₯자λ₯Ό .txt둜 λ³€κ²½ν–ˆμŠ΅λ‹ˆλ‹€.
λ‹€μŒμ€ μ˜ˆμ™Έλ₯Ό νŠΈλ¦¬κ±°ν•˜λŠ” μ½”λ“œμž…λ‹ˆλ‹€.

for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)):
    pass

파일 : debug.txt

λ‹€μŒμ€ Anacondaλ₯Ό μ‚¬μš©ν•˜λŠ” Windows 10의 μ˜ˆμ™Έμž…λ‹ˆλ‹€.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

RedHatμ—μ„œλ„ λ§ˆμ°¬κ°€μ§€μž…λ‹ˆλ‹€.

$ python3
Python 3.6.6 (default, Aug 13 2018, 18:24:23)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

@dgrahn λ‚˜λŠ” debug.txtλ₯Ό λ‹€μš΄λ‘œλ“œν–ˆμœΌλ©° Macμ—μ„œ pd.read_csv('debug.xt', header=None) λ₯Ό μ‹€ν–‰ν•˜λ©΄ λ‹€μŒμ„ μ–»μŠ΅λ‹ˆλ‹€.

ParserError : 데이터 토큰 ν™” 였λ₯˜. C 였λ₯˜ : 3 행에 204 개의 ν•„λ“œκ°€ μžˆμ–΄μ•Όν•©λ‹ˆλ‹€. 2504λ₯Ό λ³΄μ•˜μŠ΅λ‹ˆλ‹€.

μ›λž˜ μ„€λͺ… 된 Buffer overflow caught 였λ₯˜μ™€ λ‹€λ¦…λ‹ˆλ‹€.

debug.txt νŒŒμΌμ„ κ²€μ‚¬ν–ˆμœΌλ©° 처음 두 μ€„μ—λŠ” 204 개의 열이 μžˆμ§€λ§Œ μ„Έ 번째 μ€„μ—λŠ” 2504 개의 열이 μžˆμŠ΅λ‹ˆλ‹€. μ΄λ ‡κ²Œν•˜λ©΄ νŒŒμΌμ„ ꡬ문 뢄석 ν•  수 μ—†κ²Œλ˜κ³  였λ₯˜κ°€ λ°œμƒν•˜λŠ” μ΄μœ κ°€ μ„€λͺ…λ©λ‹ˆλ‹€.

이것이 μ˜ˆμƒ λ˜λŠ”κ°€? GitHubλŠ” μ—…λ‘œλ“œ 된 예제λ₯Ό μ—‰λ§μœΌλ‘œ λ§Œλ“œλŠ” κ°œν–‰ μœ ν˜• ( "rn"및 "n")간에 λ°±κ·ΈλΌμš΄λ“œμ—μ„œ μ•”μ‹œ 적 λ³€ν™˜μ„ μˆ˜ν–‰ ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

@joshlk μœ„μ˜ μ„€λͺ…에 μ„€λͺ… λœλŒ€λ‘œ names=range(2504) μ˜΅μ…˜μ„ μ‚¬μš© ν–ˆμŠ΅λ‹ˆκΉŒ?

@dgrahn 쒋은 μ§€μ μž…λ‹ˆλ‹€.

OkλŠ” 이제 pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)) 였λ₯˜λ₯Ό μž¬ν˜„ ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

pandas.read_csv('debug.csv', names=range(2504)) 은 (λŠ”) 잘 μž‘λ™ν•˜λ―€λ‘œ μ›λž˜ 버그와 관련이 없을 κ°€λŠ₯성이 λ†’μ§€λ§Œ λ™μΌν•œ 증상이 λ‚˜νƒ€λ‚©λ‹ˆλ‹€.

@joshlk μ„ ν˜Έν•˜λŠ” 경우 λ³„λ„μ˜ 문제λ₯Ό μ—΄ β€‹β€‹μˆ˜ μžˆμŠ΅λ‹ˆλ‹€.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='python')

λ‚΄ 문제λ₯Ό ν•΄κ²°ν–ˆμŠ΅λ‹ˆλ‹€.

engine = 'python'

이 μ ‘κ·Ό 방식을 μ‹œλ„ν•˜κ³  λŒ€μš©λŸ‰ 데이터 νŒŒμΌμ„ μ—…λ‘œλ“œ ν•  μˆ˜μžˆμ—ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 데이터 ν”„λ ˆμž„μ˜ 차원을 ν™•μΈν–ˆμ„ λ•Œ ν–‰ μˆ˜κ°€ μ¦κ°€ν–ˆμŒμ„ μ•Œμ•˜μŠ΅λ‹ˆλ‹€. 이λ₯Όμœ„ν•œ 논리 μ˜μ—­μ€ λ¬΄μ—‡μΌκΉŒμš”?

이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
0 / 5 - 0 λ“±κΈ‰