defaultdict的內存效率

我在嘗試一些計算PMI的例子，試圖計算我收到的一些tweet消息（收集〜50k），如果發現algorithm的執行瓶頸在defaultdict(lambda : defaultdict(int))中，並且我不知道爲什麼：defaultdict的內存效率

這裏是我異形它的例子，並採取了大量的內存和時間

for term, n in p_t.items(): 
    positive_assoc = sum(pmi[term][tx] for tx in positive_vocab) 
    negative_assoc = sum(pmi[term][tx] for tx in negative_vocab) 
    semantic_orientation[term] = positive_assoc - negative_assoc

其中一部分：

positive_assoc = sum(pmi[term][tx] for tx in positive_vocab) 
negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)

由於某種原因分配了大量內存。我假設對於不存在的值返回0，所以傳遞給總和函數的數組非常大。

我用簡單的if value exist和一個變量sum_pos解決了這個問題。

從博客的整個實現：

pmi = defaultdict(lambda : defaultdict(int)) 
for t1 in p_t: 
    for t2 in com[t1]: 
     denom = p_t[t1] * p_t[t2] 
     pmi[t1][t2] = math.log2(p_t_com[t1][t2]/denom) 

semantic_orientation = {} 
for term, n in p_t.items(): 
    positive_assoc = sum(pmi[term][tx] for tx in positive_vocab) 
    negative_assoc = sum(pmi[term][tx] for tx in negative_vocab) 
    semantic_orientation[term] = positive_assoc - negative_assoc

來源

2015-06-28 badc0re

這裏的什麼是'defaultdict'？ –

defaultdict將要求每個工廠功能和缺少每一個關鍵。如果您在密鑰丟失很多的sum()中使用它，則確實會創建一個包含更多密鑰而不使用它們的整個負載字典。正在創建

切換到使用這裏的dict.get() method防止對象：

positive_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in positive_vocab) 
negative_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in negative_vocab)

注意，pmi.get()調用返回一個空的字典，使鏈dict.get()調用繼續工作，可如果返回默認0沒有與給定的term相關聯的字典。

來源

2015-06-28 17:21:36

謝謝。很高興知道這種東西。 – badc0re

我喜歡Martjin的回答......但是這也應該可行，而且您可能會發現它更具可讀性。

positive_assoc = sum(pmi[term][tx] for tx in positive_vocab if term in pmi and tx in pmi[term) negative_assoc = sum(pmi[term][tx] for tx in negative_vocab if term in pmi and tx in pmi[term)

來源

2015-06-28 19:07:30

defaultdict的內存效率

回答

相關問題