2017-02-04 140 views
0

我想讀一個大的CSV文件(約17GB)到python Spyder使用熊貓模塊。這裏是我的代碼CParserError當讀取CSV文件到Python Spyder

data =pd.read_csv('example.csv', encoding = 'ISO-8859-1') 

但我不斷收到CParserError錯誤消息

Traceback (most recent call last): 

File "<ipython-input-3-3993cadd40d6>", line 1, in <module> 
data =pd.read_csv('newsall.csv', encoding = 'ISO-8859-1') 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f 
return _read(filepath_or_buffer, kwds) 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 325, in _read 
return parser.read() 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read 
ret = self._engine.read(nrows) 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read 
data = self._reader.read(nrows) 

File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748) 

File "pandas\parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9003) 

File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731) 

File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602) 

File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325) 

CParserError: Error tokenizing data. C error: out of memory 

我知道有關於這個問題一些討論,但它似乎很具體,從各有不同的情況。有人可以幫助我嗎?

我在Windows系統上使用python 3。提前致謝。

編輯:

至於建議的ResMar,我嘗試下面的代碼

data = pd.DataFrame() 
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000) 
for chunk in reader: 
    data.append(chunk, ignore_index=True) 

但它與

data.shape 
Out[12]: (0, 0) 

然後返回什麼,我嘗試下面的代碼

data = pd.DataFrame() 
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000) 
for chunk in reader: 
    data = data.append(chunk, ignore_index=True) 

這再次說明運行內存不足的錯誤,這裏是引用

Traceback (most recent call last): 

File "<ipython-input-23-ee9021fcc9b4>", line 3, in <module> 
for chunk in reader: 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 795, in __next__ 
return self.get_chunk() 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 836, in get_chunk 
return self.read(nrows=size) 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read 
ret = self._engine.read(nrows) 

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read 
data = self._reader.read(nrows) 

File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748) 

File "pandas\parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9208) 

File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731) 

File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602) 

File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325) 

CParserError: Error tokenizing data. C error: out of memory 

回答

0

在我看來是很明顯的錯誤是什麼:計算機內存用完。該文件本身是17GB,根據經驗,pandas在讀取文件時佔用的空間大約會增加一倍。所以你需要大約 34GB的RAM直接讀取這些數據。

這些天的大多數計算機都有4,8或16 GB;有幾個有32個。你的計算機內存耗盡,當你這樣做的時候,C會殺死你的進程。

你可以通過閱讀你的數據塊,做任何你想做的事情,依次對每個段進行解決。見chunksize參數pd.read_csv關於更多的細節,但你基本上想要的東西,看起來像:

for chunk in pd.read_csv("...", chunksize=10000): 
    do_something() 
+0

感謝您的回答。我只是想以數據框的形式讀取數據,應該爲do_something編寫什麼代碼? –

+0

這是給你確定的。 –

+0

你能看看我編輯的問題嗎?它仍然提供錯誤。 –