Pandas: read_csv C-engine CParserError : 데이터 토큰 화 오류

에 만든 2015년 09월 22일 · 15코멘트 · 출처: pandas-dev/pandas

안녕하세요,

C-engine read_csv에 문제가있는 데이터 세트가 있습니다. 정확한 문제는 확실하지 않지만 한 줄로 좁혀서 dropbox에 업로드했습니다 . 피클을 얻으면 다음을 시도하십시오.

df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')

다음 예외가 발생합니다.

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

파이썬 엔진을 사용하여 CSV를 읽으려고하면 예외가 발생하지 않습니다.

df.read_csv('faulty_row.csv', encoding='utf8', engine='python')

문제가 to_csv가 아닌 read_csv에 있다고 제안합니다. 내가 사용하는 버전은 다음과 같습니다.

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2

Bug IO CSV Needs Info

출처

joshlk

👍16 🚀1 ❤1 🎉1

가장 유용한 댓글

댓글처럼 보였기 때문에 @alfonsomhc 답변을 놓쳤습니다.

당신은 필요합니다

df = pd.read_csv('test.csv', engine='python')

justinjdickow 에 2018년 01월 10일

👍43 ❤10 🚀5 🎉5 😄3 👎3 👀1

모든 15 댓글

두 번째에서 마지막 줄에는 '\r' 휴식이 포함됩니다. 버그라고 생각하지만 한 가지 해결 방법은 범용 개행 모드로 여는 것입니다.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='c')

chris-b1 에 2015년 09월 23일

👍41

이 오류도 발생합니다. @ chris-b1에서 제안한 방법을 사용하면 다음 오류가 발생합니다.

Traceback (most recent call last):
  File "C:/Users/je/Desktop/Python/comparison.py", line 30, in <module>
    encoding='utf-8', engine='c')
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas\parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4948)
  File "pandas\parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas\parser.c:7386)
  File "pandas\parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)
  File "pandas\parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas\parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

jelmelk 에 2016년 02월 21일

shaynekang 에 2016년 03월 21일

👍6

기본 egine으로 큰 csv 파일을 읽을 때도이 문제를 발견했습니다. engine = 'python'을 사용하면 잘 작동합니다.

alfonsomhc 에 2017년 05월 18일

👍36

댓글처럼 보였기 때문에 @alfonsomhc 답변을 놓쳤습니다.

당신은 필요합니다

df = pd.read_csv('test.csv', engine='python')

justinjdickow 에 2018년 01월 10일

👍43 ❤10 🚀5 🎉5 😄3 👎3 👀1

csv 파일이 아닌 폴더를 읽으려고 할 때 동일한 문제가 발생했습니다.

Vozf 에 2018년 09월 29일

👍3

이 문제를 조사한 사람이 있습니까? keras 생성기에서 read_csv를 사용할 때 성능이 저하됩니다.

dgrahn 에 2018년 10월 31일

제공된 원본 데이터는 더 이상 사용할 수 없으므로 문제를 재현 할 수 없습니다. 문제가 무엇인지 명확하지 않아 마무리하지만 @dgrahn 또는 다른 사람이 재현 가능한 예제를 제공 할 수 있으면 다시 열 수 있습니다.

WillAyd 에 2018년 10월 31일

@WillAyd 추가 정보가 필요하면 알려주세요.

GitHub는 CSV를 허용하지 않기 때문에 확장자를 .txt로 변경했습니다.
다음은 예외를 트리거하는 코드입니다.

for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)):
    pass

파일 : debug.txt

다음은 Anaconda를 사용하는 Windows 10의 예외입니다.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

RedHat에서도 마찬가지입니다.

$ python3
Python 3.6.6 (default, Aug 13 2018, 18:24:23)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

dgrahn 에 2018년 11월 05일

@dgrahn 나는 debug.txt를 다운로드했으며 Mac에서 pd.read_csv('debug.xt', header=None) 를 실행하면 다음을 얻습니다.

ParserError : 데이터 토큰 화 오류. C 오류 : 3 행에 204 개의 필드가 있어야합니다. 2504를 보았습니다.

원래 설명 된 Buffer overflow caught 오류와 다릅니다.

debug.txt 파일을 검사했으며 처음 두 줄에는 204 개의 열이 있지만 세 번째 줄에는 2504 개의 열이 있습니다. 이렇게하면 파일을 구문 분석 할 수 없게되고 오류가 발생하는 이유가 설명됩니다.

이것이 예상 되는가? GitHub는 업로드 된 예제를 엉망으로 만드는 개행 유형 ( "rn"및 "n")간에 백그라운드에서 암시 적 변환을 수행 할 수 있습니다.

joshlk 에 2018년 11월 05일

@joshlk 위의 설명에 설명 된대로 names=range(2504) 옵션을 사용 했습니까?

dgrahn 에 2018년 11월 05일

😄1 👍1

@dgrahn 좋은 지적입니다.

Ok는 이제 pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)) 오류를 재현 할 수 있습니다.

pandas.read_csv('debug.csv', names=range(2504)) 은 (는) 잘 작동하므로 원래 버그와 관련이 없을 가능성이 높지만 동일한 증상이 나타납니다.

joshlk 에 2018년 11월 05일

@joshlk 선호하는 경우 별도의 문제를 열 수 있습니다.

dgrahn 에 2018년 11월 05일

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='python')

내 문제를 해결했습니다.

egenc 에 2020년 06월 17일

engine = 'python'

이 접근 방식을 시도하고 대용량 데이터 파일을 업로드 할 수있었습니다. 그러나 데이터 프레임의 차원을 확인했을 때 행 수가 증가했음을 알았습니다. 이를위한 논리 영역은 무엇일까요?

dheeman00 에 2020년 10월 11일

이 페이지가 도움이 되었나요?

0 / 5 - 0 등급

Pandas: read_csv C-engine CParserError : 데이터 토큰 화 오류

가장 유용한 댓글

모든 15 댓글

관련 문제