我有一個大的gzip
文件,我想導入到一個熊貓數據框中。不幸的是,該文件的列數不均勻。數據大致有以下格式:ValueError:通過塊導入數據到pandas.csv_reader()
.... Col_20: 25 Col_21: 23432 Col22: 639142
.... Col_20: 25 Col_22: 25134 Col23: 243344
.... Col_21: 75 Col_23: 79876 Col25: 634534 Col22: 5 Col24: 73453
.... Col_20: 25 Col_21: 32425 Col23: 989423
.... Col_20: 25 Col_21: 23424 Col22: 342421 Col23: 7 Col24: 13424 Col 25: 67
.... Col_20: 95 Col_21: 32121 Col25: 111231
作爲一個測試,我嘗試這樣做:
import pandas as pd
filename = `path/to/filename.gz`
for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python'):
print(chunk)
這是我得到的回報的錯誤:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 795, in __next__
return self.get_chunk()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 836, in get_chunk
return self.read(nrows=size)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 1761, in read
alldata = self._rows_to_cols(content)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 2166, in _rows_to_cols
raise ValueError(msg)
ValueError: Expected 18 fields in line 28, saw 22
你怎麼分配一定數量的pandas.read_csv()列?
你的問題是一些格式不正確的csv,它與預分配列數無關,您需要進行一些額外的調試以查找具體格式不正確的文件和行,您應該發佈指向csv的鏈接或重現錯誤的小樣本 – EdChum
@EdChum它不只是一行 - 這個文件實際上每行都是這樣的。有些行可能有20列,接下來的28行是什麼? – ShanZhengYang
我無法在沒有看到具體數據的情況下回答假設性問題,發佈數據時應該有定期的分隔符和表單,如果不是,那麼您需要首先清理數據 – EdChum