使用Python3時,熊貓0.12熊貓ParserError EOF字符讀取多個CSV文件HDF5
我想寫多個CSV文件(總大小爲7.9 GB)的HDF5商店後開始處理。 csv文件每行包含大約一百萬行,15列和數據類型大多是字符串,但有些浮點數。然而,當我試圖讀取CSV文件,我得到以下錯誤:
Traceback (most recent call last):
File "filter-1.py", line 38, in <module>
to_hdf()
File "filter-1.py", line 31, in to_hdf
for chunk in reader:
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__
yield self.read(self.chunksize)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)
File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)
File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done
編輯:
我設法找到產生這個問題的文件。我認爲它正在閱讀一個EOF字符。但我不知道克服這個問題。鑑於組合文件的大尺寸,我認爲檢查每個字符串中的每個單個字符都太麻煩。 (即使這樣,我仍然不確定該怎麼做。)據我檢查,csv文件中沒有可能引發錯誤的奇怪字符。 我也嘗試通過error_bad_lines=False
到pd.read_csv()
,但錯誤仍然存在。
我的代碼如下:
# -*- coding: utf-8 -*-
import pandas as pd
import os
from glob import glob
def list_files(path=os.getcwd()):
''' List all files in specified path '''
list_of_files = [f for f in glob('2013-06*.csv')]
return list_of_files
def to_hdf():
""" Function that reads multiple csv files to HDF5 Store """
# Defining path name
path = 'ta_store.h5'
# If path exists delete it such that a new instance can be created
if os.path.exists(path):
os.remove(path)
# Creating HDF5 Store
store = pd.HDFStore(path)
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', mode='w', table=True)
# Return store
return store.select('ta_data')
return 'Finished reading to HDF5 Store, continuing processing data.'
to_hdf()
編輯
如果我進入那個引發CParserError EOF CSV文件...並手動刪除行之後的所有行,是造成問題,csv文件被正確讀取。不過,我刪除的所有內容都是空行。 奇怪的是,當我手動更正錯誤的csv文件時,它們會單獨加載到商店中。但是當我再次使用多個文件的列表時,「錯誤」文件仍然會返回錯誤。
不通過''mode ='w''';你在每次迭代中截斷hdf文件 – Jeff
你可以嘗試捕獲CParserError並跳過該文件(直到你修復它) – Jeff
嗨,傑夫,你如何建議我抓住CParserError。檢查每個單獨的文件太麻煩了。 – Matthijs