2014-07-16 18 views
-1

我正在嘗試創建一個位置名稱和來自Geonames的信息字典,用於讀取文檔,提取位置名稱並輸出其信息的程序中。鍵是位置名稱以及與每個名稱對應的緯度和經度的元組列表,國家代碼,要素類和地理名稱標識(因爲可以有多個具有相同名稱的位置)是值。下面是本詞典的摘錄示例:數百萬GeoNames位置的字典中的Python MemoryError?

{'xixerella': [(('42.55327', '1.48736'), 'AD', 'PPL', '3038816'), (('42.55294', '1.48764'), 'AD', 'ADMD', '3038817')], 'fonts vives': [(('42.5', '1.56667'), 'AD', 'SPNG', '3038822')], 'roc del xeig': [(('42.56667', '1.48333'), 'AD', 'RK', '3038820')], 'costa de xurius': [(('42.5', '1.48333'), 'AD', 'SLP', '3038814')]} 

最終詞典有9,088,105個鍵。當我嘗試將其轉儲到與鹹菜一個文件,所以我可以在其它程序中引用它,它拋出這個錯誤:

Python(763,0xa03871a8) malloc: *** mach_vm_map(size=50331648) failed (error code=3) 
*** error: can't allocate region 
*** set a breakpoint in malloc_error_break to debug 
Traceback (most recent call last): 
    File "/Applications/Wing101.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 31, in  <module> 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1370, in dump 
    Pickler(file, protocol).dump(obj) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump 
    self.save(obj) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save 
    f(self, obj) # Call unbound method with explicit self 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict 
    self._batch_setitems(obj.iteritems()) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 663, in _batch_setitems 
    save(v) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save 
    f(self, obj) # Call unbound method with explicit self 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list 
    self._batch_appends(iter(obj)) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 615, in _batch_appends 
    save(x) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save 
    f(self, obj) # Call unbound method with explicit self 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple 
    save(element) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save 
    f(self, obj) # Call unbound method with explicit self 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 581, in save_tuple 
    self.memoize(obj) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 247, in memoize 
    self.memo[id(obj)] = memo_len, obj 
MemoryError: 

有,我應該使用的,而不是一本字典的數據結構?我能做些什麼來減少內存使用量?

這是我的程序如:

import csv 
import sys 
import pickle 

geodict = {} 
ignore = ["", " ", " ", " ", "-", " -", "- ", " - "] 
csv.field_size_limit(sys.maxsize) 
reader = csv.reader(open('allCountries-2.txt', 'rb'), delimiter='\t') 
for row in reader: 
    loc = [] 
    loc.append(row[2].lower()) 
    if row[3] != '': 
     altnames = row[3].split(',') 
     for entry in altnames: 
      entry = "".join(x for x in entry if ord(x)<128) 
      entry = entry.lower() 
      if entry not in loc: 
       if entry not in ignore: 
        loc.append(entry) 
    geoid = row[0] 
    latlong = (row[4], row[5]) 
    feature = row[7] 
    country = row[8]   
    for name in loc: 
     if name in geodict: 
      geodict[name].append((latlong, country, feature, geoid)) 
     else: 
      geodict[name] = [(latlong, country, feature, geoid)] 

with open('dict.txt', 'wb') as handle: 
    pickle.dump(geodict, handle) 

如果你不熟悉的格式/的國地名文件的內容:這是一個1.14 GB製表符分隔的文本文件,行[2]是位置名稱在純ASCII字符中,行[3]是可選的位置名稱(有時沒有alt名稱;我剝離非ASCII bc有一些瘋狂的重音字符,Python不喜歡的中文/日文/ etc字符)。如果還有什麼不清楚的地方,就問。

請幫忙!謝謝!

回答

0

當處理數據結構很大時,您應該切換到streaming pickle。它的工作原理與常規醃菜非常類似,但是以流式(增量)方式加載/保存,因此使用的內存要少得多。