獲取defaultdict的值

我從一堆或多封電子郵件中讀取數據並計算每個單詞的頻率。首先構造兩個計數器：由獲取defaultdict的值

for word in re.findall('[a-zA-Z]\w*', data): 
    counters.form[word][file_name] += 1

對於每個形式

counters.form = collections.defaultdict(dict)

獲取的頻率，存在存儲所有此單詞出現在電子郵件的計數器，和形式的在此頻率電子郵件。例如

form = {'a': {'email1':4, 'email2':3}, 
     'the': {'email1':2, 'email3':4}, 
     'or': {'email1':2, 'email3':1}}

如何獲取特定電子郵件中某種形式的頻率？的a在email2頻率爲3

來源

2012-05-10 juju

你的問題是有點混亂。也許你可以舉一個小例子？ – happydave

你是否必須使用'defaultdict'，因爲它是作業？ 'collections.Counter'會更合適 –

@gnibbler當我使用collections.Counter時，它告訴我該對象不可迭代。 – juju

看來你正在建立IR（信息檢索）社區所稱的倒排索引。在這種情況下，我同意你正在採取的總體思路，也建議您使用計數器類與默認字典一起...

counters.form = collections.defaultdict(collections.Counter)

counters.form然後將作爲排序指標壓縮的世界模型，其中沒有觀測不是一個錯誤（也不假），只需0

使用您的form數據爲例，我們填充倒排索引像...的

#-- Build the example data into the proposed structure... 
counters.form['a'].update({'email1':4, 'email2':3}) 
counters.form['the'].update({'email1':2, 'email3':4}) 
counters.form['or'].update({'email1':2, 'email3':1}})

現在，進入這個數據表單的頻率，我們反引用就像是一個二維數組...

print counters.form['a']['email2']

...這應該打印3，更或多或少一樣你目前使用的結構。這兩種方法的真正區別在於你沒有意見。例如...

print counters.form['noword']['some-email']

...使用當前的結構（collections.defaultdict(dict)），「noword」對counters.form的get會「小姐」與defaultdict將一個新建成的，空的字典自動關聯到counters.form['noword'] ;然而，當這個空字典然後查詢關鍵字：'some-email'時，空字典沒有這樣的關鍵字，導致KeyError例外'some-email'

如果相反我們使用建議的結構collections.defaultdict(collections.Counter)），那麼counters.form上的'noword'的獲得將會遺漏，並且新的collections.Counter將與關鍵詞'noword'相關聯。當計數器被詢問（在第二次解除引用中）'some-email'時，計數器將響應0-這是（我相信）期望的行爲。

其他一些食譜......

#-- Show distinct emails which contain 'someword' 
emails = list(counters.form['someword']) 

#-- Show tally of all observations of 'someword' 
tally = sum(counters.form['someword'].values())

來源

2012-05-10 04:55:08 parselmouth

如果我想寫一個for循環來檢查所有內容，該怎麼辦？我應該在counters.form.items（）中使用'for token：'： – juju

@juju counters.form.items（）將返回一個形式爲[（'a'，），（'''，），...]，它適用於列表處理功能，如地圖，過濾器等。 – parselmouth

這可能是一個好主意，而不是使用一個defaultdict的Counter類：

A計數器是用於計算哈希的對象的字典子類。它是一個無序的集合，其元素以字典鍵的形式存儲，並將其計數存儲爲字典值。計數允許爲包括零或負計數的任何整數值。 Counter類與其他語言的bag或multisets類似。

來源

2012-05-10 03:15:42

這裏的方法失去了關於計數來源的信息。 juju問「如何在某個電子郵件中獲得某種形式的頻率？」，只有一個計數器（並且沒有其他輔助索引字典等）不能回答這個問題。 – parselmouth

獲取defaultdict的值

回答

相關問題