μλ νμΈμ,
C-engine read_csvμ λ¬Έμ κ°μλ λ°μ΄ν° μΈνΈκ° μμ΅λλ€. μ νν λ¬Έμ λ νμ€νμ§ μμ§λ§ ν μ€λ‘ μ’νμ dropboxμ μ λ‘λνμ΅λλ€ . νΌν΄μ μ»μΌλ©΄ λ€μμ μλνμμμ€.
df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')
λ€μ μμΈκ° λ°μν©λλ€.
CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
νμ΄μ¬ μμ§μ μ¬μ©νμ¬ CSVλ₯Ό μ½μΌλ €κ³ νλ©΄ μμΈκ° λ°μνμ§ μμ΅λλ€.
df.read_csv('faulty_row.csv', encoding='utf8', engine='python')
λ¬Έμ κ° to_csvκ° μλ read_csvμ μλ€κ³ μ μν©λλ€. λ΄κ° μ¬μ©νλ λ²μ μ λ€μκ³Ό κ°μ΅λλ€.
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
λ λ²μ§Έμμ λ§μ§λ§ μ€μλ '\r'
ν΄μμ΄ ν¬ν¨λ©λλ€. λ²κ·ΈλΌκ³ μκ°νμ§λ§ ν κ°μ§ ν΄κ²° λ°©λ²μ λ²μ© κ°ν λͺ¨λλ‘ μ¬λ κ²μ
λλ€.
pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='c')
μ΄ μ€λ₯λ λ°μν©λλ€. @ chris-b1μμ μ μν λ°©λ²μ μ¬μ©νλ©΄ λ€μ μ€λ₯κ° λ°μν©λλ€.
Traceback (most recent call last):
File "C:/Users/je/Desktop/Python/comparison.py", line 30, in <module>
encoding='utf-8', engine='c')
File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas\parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4948)
File "pandas\parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas\parser.c:7386)
File "pandas\parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)
File "pandas\parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas\parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
+1
κΈ°λ³Έ egineμΌλ‘ ν° csv νμΌμ μ½μ λλμ΄ λ¬Έμ λ₯Ό λ°κ²¬νμ΅λλ€. engine = 'python'μ μ¬μ©νλ©΄ μ μλν©λλ€.
λκΈμ²λΌ 보μκΈ° λλ¬Έμ @alfonsomhc λ΅λ³μ λμ³€μ΅λλ€.
λΉμ μ νμν©λλ€
df = pd.read_csv('test.csv', engine='python')
csv νμΌμ΄ μλ ν΄λλ₯Ό μ½μΌλ €κ³ ν λ λμΌν λ¬Έμ κ° λ°μνμ΅λλ€.
μ΄ λ¬Έμ λ₯Ό μ‘°μ¬ν μ¬λμ΄ μμ΅λκΉ? keras μμ±κΈ°μμ read_csvλ₯Ό μ¬μ©ν λ μ±λ₯μ΄ μ νλ©λλ€.
μ 곡λ μλ³Έ λ°μ΄ν°λ λ μ΄μ μ¬μ©ν μ μμΌλ―λ‘ λ¬Έμ λ₯Ό μ¬ν ν μ μμ΅λλ€. λ¬Έμ κ° λ¬΄μμΈμ§ λͺ ννμ§ μμ λ§λ¬΄λ¦¬νμ§λ§ @dgrahn λλ λ€λ₯Έ μ¬λμ΄ μ¬ν κ°λ₯ν μμ λ₯Ό μ 곡 ν μ μμΌλ©΄ λ€μ μ΄ μ μμ΅λλ€.
@WillAyd μΆκ° μ λ³΄κ° νμνλ©΄ μλ €μ£ΌμΈμ.
GitHubλ CSVλ₯Ό νμ©νμ§ μκΈ° λλ¬Έμ νμ₯μλ₯Ό .txtλ‘ λ³κ²½νμ΅λλ€.
λ€μμ μμΈλ₯Ό νΈλ¦¬κ±°νλ μ½λμ
λλ€.
for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)):
pass
νμΌ : debug.txt
λ€μμ Anacondaλ₯Ό μ¬μ©νλ Windows 10μ μμΈμ λλ€.
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
return self.get_chunk()
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
RedHatμμλ λ§μ°¬κ°μ§μ λλ€.
$ python3
Python 3.6.6 (default, Aug 13 2018, 18:24:23)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
return self.get_chunk()
File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
@dgrahn λλ debug.txtλ₯Ό λ€μ΄λ‘λνμΌλ©° Macμμ pd.read_csv('debug.xt', header=None)
λ₯Ό μ€ννλ©΄ λ€μμ μ»μ΅λλ€.
ParserError : λ°μ΄ν° ν ν° ν μ€λ₯. C μ€λ₯ : 3 νμ 204 κ°μ νλκ° μμ΄μΌν©λλ€. 2504λ₯Ό 보μμ΅λλ€.
μλ μ€λͺ
λ Buffer overflow caught
μ€λ₯μ λ€λ¦
λλ€.
debug.txt νμΌμ κ²μ¬νμΌλ©° μ²μ λ μ€μλ 204 κ°μ μ΄μ΄ μμ§λ§ μΈ λ²μ§Έ μ€μλ 2504 κ°μ μ΄μ΄ μμ΅λλ€. μ΄λ κ²νλ©΄ νμΌμ ꡬ문 λΆμ ν μ μκ²λκ³ μ€λ₯κ° λ°μνλ μ΄μ κ° μ€λͺ λ©λλ€.
μ΄κ²μ΄ μμ λλκ°? GitHubλ μ λ‘λ λ μμ λ₯Ό μλ§μΌλ‘ λ§λλ κ°ν μ ν ( "rn"λ° "n")κ°μ λ°±κ·ΈλΌμ΄λμμ μμ μ λ³νμ μν ν μ μμ΅λλ€.
@joshlk μμ μ€λͺ
μ μ€λͺ
λλλ‘ names=range(2504)
μ΅μ
μ μ¬μ© νμ΅λκΉ?
@dgrahn μ’μ μ§μ μ λλ€.
Okλ μ΄μ pandas.read_csv('debug.csv', chunksize=1000, names=range(2504))
μ€λ₯λ₯Ό μ¬ν ν μ μμ΅λλ€.
pandas.read_csv('debug.csv', names=range(2504))
μ (λ) μ μλνλ―λ‘ μλ λ²κ·Έμ κ΄λ ¨μ΄ μμ κ°λ₯μ±μ΄ λμ§λ§ λμΌν μ¦μμ΄ λνλ©λλ€.
@joshlk μ νΈνλ κ²½μ° λ³λμ λ¬Έμ λ₯Ό μ΄ ββμ μμ΅λλ€.
pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='python')
λ΄ λ¬Έμ λ₯Ό ν΄κ²°νμ΅λλ€.
engine = 'python'
μ΄ μ κ·Ό λ°©μμ μλνκ³ λμ©λ λ°μ΄ν° νμΌμ μ λ‘λ ν μμμμ΅λλ€. κ·Έλ¬λ λ°μ΄ν° νλ μμ μ°¨μμ νμΈνμ λ ν μκ° μ¦κ°νμμ μμμ΅λλ€. μ΄λ₯Όμν λ Όλ¦¬ μμμ 무μμΌκΉμ?
κ°μ₯ μ μ©ν λκΈ
λκΈμ²λΌ 보μκΈ° λλ¬Έμ @alfonsomhc λ΅λ³μ λμ³€μ΅λλ€.
λΉμ μ νμν©λλ€