使用大型文本語料庫時出現內存錯誤

我有一個很大的文本文件（〜450MB-> 129,000行和457,000,000個字符），當我嘗試使用此文件一段時間後，Memory Error上升時，這裏是我的代碼：使用大型文本語料庫時出現內存錯誤

docDict = {} 
ind = 1 

with open('somefile.txt',encoding='utf-8') as f: 
    for line in f: 
     data = line.split(' ') 
     docDict[ind] = data 
     ind+=1

我看到this，但我一行一行讀取文件。

來源

2016-12-21 Arman

memory error在這裏引發，因爲即使您逐行讀取文件，也會將其內容存儲在字典docDict中，因此存儲在內存中。

我不知道你打算用這個字典做什麼，但我建議每個你讀線後做的過程中，然後將結果存儲在一個變量（如果進程壓縮了很多），或直接在文件或數據庫中。

希望我幫了忙！再見！

來源

2016-12-21 10:37:12

爲了測試代碼中數據結構的開銷，我編寫了下面的測試程序。它假定你的文本文件是ASCII編碼的N兆字節，相對較短的行。（我不得不從450改變N至150後，我的物理存儲器跑出。）

import sys 

MB = 1024 * 1024 

line = "the quick brown fox jumps over the lazy dog" 
megs = 150 
nlines = (megs * MB)/len(line) 

d = {} 
for i in xrange(nlines): 
    d[i] = line.split(' ') 

dict_size = sys.getsizeof(d) 
list_size = sum(sys.getsizeof(a) for a in d.items()) 
item_size = sum(sum(sys.getsizeof(s) for s in a) for a in d.items()) 

print " dict:", dict_size/float(MB), "MB" 
print "lists:", list_size/float(MB), "MB" 
print "items:", item_size/float(MB), "MB" 
print "total:", (dict_size + list_size + item_size)/float(MB), "MB"

其結果：

dict: 192.00 MB 
lists: 251.16 MB 
items: 669.77 MB 
total: 1112.9 MB

觀看活動監視器，Python的進程超過2千兆字節的內存使用情況的，所以也有一些記憶不算。 malloc實施的文物可能是一種可能性。

我在C++實現的相同的程序：

#include <string> 
#include <vector> 
#include <unordered_map> 

int main() 
{ 
    int const MB = 1024 * 1024; 

    std::string const line = "the quick brown fox jumps over the lazy dog"; 
    std::vector<std::string> const split = { 
     "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog" 
    }; 

    int const megs = 150; 
    int const nlines = (megs * MB)/line.size(); 

    std::unordered_map<int, std::vector<std::string>> d; 
    for (int i = 0; i < nlines; ++i) { 
     d[i] = split; 
    } 
}

編譯時clang++ -O3，這使用了大約1GB的內存。 C++沒有sys.getsizeof()，所以它需要更多的工作來分解內存使用，而我沒有這樣做。

相當於C++的兩倍內存對於Python來說實際上是一個相當不錯的結果，因此我將刪除關於cPython實現的預編輯註釋。

我認爲你的主要問題是將行存儲爲一個短字符串數組。是否可以將線條作爲整個字符串存儲並根據需要分割它們，但不是一次全部分割？

你的程序的最終目標是什麼？

來源

2016-12-21 11:08:57 japreiss

不幸的是，每行至少有100個單詞到1000個單詞！我怎樣才能使我的代碼複雜化？ – Arman

其實，如果我把程序改成'line = 10 *「，快速的棕色狐狸跳過懶狗''，內存的使用量就會下降很多，即使是'items'。這是令人驚訝的 - 我只希望「字典」和「列表」開銷下降。 – japreiss

增加了一個C++比較和新的結論，請參閱編輯。 – japreiss

使用大型文本語料庫時出現內存錯誤

回答

相關問題