如何在python字典詞典中找到最常用的詞條

我有一個兩層深的字典。也就是說，第一個字典中的每個鍵都是一個url，值是另一個字典，每個鍵都是單詞，每個值是該單詞在該URL上出現的次數。它看起來是這樣的：如何在python字典詞典中找到最常用的詞條

dic = { 
    'http://www.cs.rpi.edu/news/seminars.html': { 
     'hyper': 1, 
     'summer': 2, 
     'expert': 1, 
     'koushk': 1, 
     'semantic': 1, 
     'feedback': 1, 
     'sandia': 1, 
     'lewis': 1, 
     'global': 1, 
     'yener': 1, 
     'laura': 1, 
     'troy': 1, 
     'session': 1, 
     'greenhouse': 1, 
     'human': 1 

...and so on...

字典本身是很長，在它25頁的URL，每個URL有另一個字典作爲其URL和次數其內發現找到的每個字值。

我想找到出現在詞典中最不同的網址上的單詞或單詞。所以輸出應該是這個樣子：

The following words appear x times on y pages: list of words

來源

2013-04-10 compscimaster

您能否提供完整的示例輸入和輸出？ – thegrinner 2013-04-10 18:24:48

另外，你到目前爲止嘗試過什麼？ – 2013-04-10 18:40:33

A計數器是不太你想要什麼。從您顯示的輸出中，看起來您要跟蹤出現的總次數以及單詞出現的頁數。

data = { 
    'page1': { 
     'word1': 5, 
     'word2': 10, 
     'word3': 2, 
    }, 
    'page2': { 
     'word2': 2, 
     'word3': 1, 
    } 
} 

from collections import defaultdict 
class Entry(object): 
    def __init__(self): 
     self.pages = 0 
     self.occurrences = 0 
    def __iadd__(self, occurrences): 
     self.pages += 1 
     self.occurrences += occurrences 
     return self 
    def __str__(self): 
     return '{} occurrences on {} pages'.format(self.occurrences, self.pages) 
    def __repr__(self): 
     return '<Entry {} occurrences, {} pages>'.format(self.occurrences, self.pages) 

counts = defaultdict(Entry) 

for page_words in data.itervalues(): 
    for word, count in page_words.iteritems(): 
     counts[word] += count 

for word, entry in counts.iteritems(): 
    print word, ':', entry

這將產生以下的輸出：

word1 : 5 occurrences on 1 pages 
word3 : 3 occurrences on 2 pages 
word2 : 12 occurrences on 2 pages

這將捕捉你想要的信息，下一步是要找到最常見的n話。你可以使用heapsort來做到這一點（它具有方便的功能，不需要按頁數和出現次數對整個單詞列表進行排序 - 如果總共有很多單詞，這可能很重要，但是可以使用n 'top n'相對較小）。

from heapq import nlargest 
def by_pages_then_occurrences(item): 
    entry = item[1] 
    return entry.pages, entry.occurrences 
print nlargest(2, counts.iteritems(), key=by_pages_then_occurrences)

來源

2013-04-10 19:00:06 babbageclunk

這正是我想要的，謝謝 – compscimaster 2013-04-10 19:43:46

看來，你應該使用這個Counter：

from collections import Counter 
print sum((Counter(x) for x in dic.values()),Counter()).most_common()

還是多版本：

c = Counter() 
for d in dic.values(): 
    c += Counter(d) 

print c.most_common()

要得到所有常見的詞：

subdicts = iter(dic.values()) 
s = set(next(subdicts)).intersection(*subdicts)

現在你可以使用該組來過濾所產生的櫃檯，消除不出現在每個subdict話：

c = Counter((k,v) for k,v in c.items() if k in s) 
print c.most_common()

來源

2013-04-10 18:27:51 mgilson

'from collections import Counter' – Kimvais 2013-04-10 18:28:42

我想知道爲什麼不能將常規字典添加到計數器中... – mgilson 2013-04-10 18:32:37

您同樣的理由不能使用字典和列表作爲字典鍵：__mutability__。你會得到一個TypeError：不可用的類型：'dict''，不是嗎？ – Kimvais 2013-04-10 18:41:47

如何在python字典詞典中找到最常用的詞條

回答

相關問題