380+ MB csv νμΌμ μ½μΌλ €κ³ ν λ read_csv
(Pandas 0.17.0)μ λ¬Έμ κ° μμ΅λλ€. νμΌμ 54 κ° νλλ‘ μμνμ§λ§ μΌλΆ μ€μλ 54 κ° λμ 53 κ° νλκ° μμ΅λλ€. μλ μ½λλ₯Ό μ€ννλ©΄ λ€μ μ€λ₯κ° λ°μν©λλ€.
parser = lambda x: datetime.strptime(x, '%y %m %d %H %M %S %f')
df = pd.read_csv(filename,
names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
usecols=range(0, 42),
parse_dates={"TIMESTAMP": [0, 1, 2, 3, 4, 5, 6]},
date_parser=parser,
header=None)
μ€λ₯:
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
error_bad_lines=False
ν€μλλ₯Ό μ λ¬νλ©΄ μλ μμ μ μ¬ν λ¬Έμ κ°μλ μ€μ΄ νμλ©λλ€.
Skipping line 1683401: expected 53 fields, saw 54
κ·Έλ¬λ μ΄λ²μλ λ€μκ³Ό κ°μ μ€λ₯κ° λ°μν©λλ€ (λν DataFrameμ΄λ‘λλμ§ μμ).
CParserError: Too many columns specified: expected 54 and found 53
engine='python'
ν€μλλ₯Ό μ λ¬νλ©΄ μ€λ₯κ° λ°μνμ§ μμ§λ§ λ°μ΄ν°λ₯Ό ꡬ문 λΆμνλ λ° μκ°μ΄ λ§μ΄ 걸립λλ€. error_bad_lines=False
μ¬μ© μ¬λΆμ λ°λΌ μ€λ₯ λ©μμ§μμ 53κ³Ό 54κ° μ νλ©λλ€.
μ΄ μ€λ₯λ λͺ¨λ μ νν©λλ€. usecols
λ° names
μ λ¬νμ¬ νμκ° μννλ μμ
μ μ νν©λλ€. μ΄ μμ
μ μννμ§ λ§κ³ νμ± ν μ μλμ§ νμΈνμμμ€.
μ¬ννλ νμΌμ μν μμ΄λ μ΄μ κ°μ κ²μ μ§λ¨νκΈ°κ° λ§€μ° μ΄λ ΅μ΅λλ€.
pd.show_versions()
λ νμ
μλ³Έ λ°μ΄ν° νμΌ μ¬μ© :
λ€λ₯Έ ν€μλκ°μλ pd.read_csv(filename)
λ μ€λ₯μμ΄ λ°μ΄ν°λ₯Ό ꡬ문 λΆμνλ κ² κ°μ΅λλ€. pd.read_csv(filename, header=None)
μμ λ€μ μ€λ₯κ° λ°μν©λλ€.
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
μν λ°μ΄ν° μμ΄λ μ§λ¨νκΈ°κ° λ§€μ° μ΄λ ΅λ€λ λ° μμ ν λμνμ΅λλ€. λͺ μ€ (μΌλΆλ 53 κ° νλ, μΌλΆλ 54 κ°)μ΄μλ csv νμΌλ‘ μ€λ₯λ₯Ό μμ±νλ €κ³ μλνμ§λ§ pd.read_csv
λ μμλλ‘ NaNμΌλ‘ κ°κ²©μ μ± μλλ€. usecols
λ° header=None
μ λ¬νμ¬ λ°λ³΅νμ§λ§ μ¬μ ν μλν©λλ€. μλ³Έ νμΌμ λͺ¨λ μ€λ₯λ₯Ό λ°μμν€λ λ¬Έμ κ°μλ κ² κ°μ΅λλ€.
pd.show_versions()
μΆλ ₯μ λ€μκ³Ό κ°μ΅λλ€.
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.22.1
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
pd.read_csv(filename, header=None) gives the following error:
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
μ΄ μλ 첫 λ²μ§Έ μ€μμ μΆλ‘ λλ―λ‘ μμλ©λλ€. names
λ₯Ό ν΅κ³Όνλ©΄μ΄λ₯Ό κ²°μ μ μΈ κΈ°λ₯μΌλ‘ μ¬μ©ν κ²μ
λλ€.
λ°λΌμ λ€μν μ΅μ
μ κ³μ μλνμμμ€. names
λ° usecols
μ€μ λ‘ μ½κ° μ ννκ³ μμ΅λλ€. κ·Έκ²μ μ½μ λ€μ νμν κ²μΌλ‘ λ€μ μμΈννλ κ²μ΄ λ λμ μ μμ΅λλ€.
engine='python'
λ₯Ό μ¬μ©νλ©΄ μ΄μνκ²λ λΈκΎΉμ§μμ΄ DataFrameμλ‘λν©λλ€. λ€μ μ€ λν«μ μ¬μ©νμ¬ νμΌμ μ²μ 3 μ€κ³Ό μλͺ»λ μ€ 3 κ°λ₯Ό μΆμΆνμ΅λλ€ (μ€λ₯ λ©μμ§μμ μ€ λ²νΈλ₯Ό μ»μμ΅λλ€).
from csv import reader
N = int(input('What line do you need? > '))
with open(filename) as f:
print(next((x for i, x in enumerate(reader(f)) if i == N), None))
1-3 ν :
['08', '8', '7', '5', '0', '12', '54', '0', '11', '1', '58', '9', '68', '48.2', '0.756', '11.6', '17.5', '13.3', '4.3', '11.3', '32.2', '6.4', '4.1', '5.6', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '15', '80', '0', '11', '1', '62', '9', '69', '77.8', '3.267', '11.2', '17.7', '14.8', '4.2', '15.2', '29.1', '18.4', '10.0', '18.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '21', '52', '0', '11', '1', '61', '11', '51', '29.4', '0.076', '4.1', '13.8', '8.3', '21.5', '5.3', '3.1', '5.7', '3.0', '6.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
μλ° λΌμΈ :
['09', '9', '15', '22', '46', '9', '51', '0', '11', '1', '57', '9', '70', '36.3', '0.242', '11.8', '16.2', '6.4', '4.1', '5.8', '31.3', '5.5', '3.9', '6.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '25', '31', '0', '11', '1', '70', '9', '73', '67.8', '2.196', '10.4', '17.0', '13.4', '4.4', '12.2', '31.8', '15.6', '4.2', '16.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '28', '41', '0', '11', '1', '70', '5', '22', '7.4', '0.003', '4.0', '13.1', '3.4', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
μ μνμ λλ‘ νμΌμ μ½μ λ€μ DataFrameμ μμ νκ±°λ (μ΄ μ΄λ¦ λ³κ²½, λΆνμν νλͺ© μμ λ±) κ°λ¨ν python
μμ§μ μ¬μ©ν©λλ€ (κΈ΄ μ²λ¦¬ μκ°).
μΆκ° μ‘°μ¬μ λ°λ₯΄λ©΄ λ€μκ³Ό κ°μ λͺ
λ Ή μνμ€κ° ββμλν©λλ€ (λ°μ΄ν°μ 첫 λ²μ§Έ μ€μ΄ μμ΅λλ€- header=None
μ‘΄μ¬νμ§ μμ§λ§ μ μ΄λλ‘λλ©λλ€).
df = pd.read_csv(filename,
usecols=range(0, 42))
df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']
λ€μμ μλνμ§ μμ΅λλ€ .
df = pd.read_csv(filename,
names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
usecols=range(0, 42))
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
λ€μμ μλνμ§ μμ΅λλ€ .
df = pd.read_csv(filename,
header=None)
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
μ¬μ© μ§λ¬ΈμΌλ‘ μ’ λ£λ©λλ€.
dic_df = _create_init_dic ( "C : / Users / swati / Downloads / VQA-Med 2018 Dataset / c5e905f7-6eb0-4a98-b284-da0729a1caf3_VQAMed2018Train / VQAMed2018Train / VQAMed2018Train-QA.csv")
ParserError : λ°μ΄ν° ν ν° ν μ€λ₯. C μ€λ₯ : 33 νμ 1 κ°μ νλκ° μμ΄μΌν©λλ€. 3 κ°λ₯Ό 보μμ΅λλ€.
μ΄ μλ:
df = pd.read_csv(filename,header=None,error_bad_lines=False)
κ·Έκ²μ μλνκ³ λΉμ·ν μ€λ₯λ₯Ό μν΄ μΌνμ΅λλ€. κ°μ¬ν©λλ€!
μ½μ λ quoting=3
μΆκ° μλ
ν¬λκ° λ λ§μ νλκ°μλ νμ μΆκ° νλλ₯Ό 무μνλ λ°©λ²μ΄ μμ΅λκΉ?
μ : "1605634 νμ 53 κ°μ νλκ° μμ΄μΌν©λλ€. 54 κ°λ₯Ό 보μλ€"
1605634 μ€μ νλ 54λ₯Ό λλ‘ν©λλ€.
λ λ€λ₯Έ κ²½μ°! κ·Έλ¬λ "error_bad_lines = False"λ‘ ν΄κ²°λμμ§λ§ μ¬μ ν μ€λ₯λ₯Ό μΈμνμ§λ§ 'μ’ λ£ μ½λ 0'
κ°μ μ€λ₯κ° λ°μνμ΅λλ€
read_csv λͺ¨λμμ ꡬλΆμ λ§€κ° λ³μλ₯Ό μΆκ°νμ΅λλ€.
κ·Έλ¦¬κ³ κ·Έκ²μ μΌνλ€
error_bad_lines = κ±°μ§
ν¨κ³Όκ°μλ€
pd.read_csv(filename, header=None) gives the following error: CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
μ΄ μλ 첫 λ²μ§Έ μ€μμ μΆλ‘ λλ―λ‘ μμλ©λλ€.
names
λ₯Ό ν΅κ³Όνλ©΄μ΄λ₯Ό κ²°μ μ μΈ κΈ°λ₯μΌλ‘ μ¬μ©ν κ²μ λλ€.λ°λΌμ λ€μν μ΅μ μ κ³μ μλνμμμ€.
names
λ°usecols
μ€μ λ‘ μ½κ° μ ννκ³ μμ΅λλ€. κ·Έκ²μ μ½μ λ€μ νμν κ²μΌλ‘ λ€μ μμΈννλ κ²μ΄ λ λμ μ μμ΅λλ€.
μλν©λλ€! R μΈμ΄λ₯Ό μ¬μ©νμ¬ csvλ₯Ό μμ±νκ³ νμ΄μ¬μΌλ‘ μ½μΌλ €κ³ ν©λλ€. 첫 λ²μ§Έ μ€μλ λͺ¨λ μ€μ μ΅λ κΈΈμ΄κ° μμ΄μΌν©λλ€. μ΄λ κ²νλ©΄ μλͺ»λ μ€μ λ¬Έμ λ₯Ό ν΄κ²°νκ³ μ€μ μμ§ μμ΅λλ€.
pd.read_csv λͺ λ Ήμ μ¬μ©νμ¬ .xlsx νμΌμ κ°μ Έ μ€λ €κ³ νλ©΄μ΄ μ€λ₯κ° λ°μν©λλ€.
pd.read_csv λμ pd.read_excelμ μ¬μ©ν΄λ³΄μμμ€.
μ΄λ₯Ό μμ νλ κ°μ₯ μ¬μ΄ λ°©λ²μ CSV νμΌμ Excel νμΌλ‘ λ³ννκ³ λ°μ΄ν° μ½κΈ°λ₯Ό μν΄ pd.read_csv λμ pd.read_excelμ μ¬μ©νλ κ²μ λλ€.
κ°μ₯ μ μ©ν λκΈ
μ΄ μλ: