Python NLTK - 基於標記返回頂部結果來計算棕色語料庫中單詞的出現次數

我試圖從語料庫中爲特定標記返回頂部出現的值。我可以將標記和單詞本身返回正常，但無法在計算結果中返回計數。Python NLTK - 基於標記返回頂部結果來計算棕色語料庫中單詞的出現次數

import itertools 
import collections 
import nltk 
from nltk.corpus import brown 

words = brown.words() 

def findtags(tag_prefix, tagged_text): 
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text 
           if tag.startswith(tag_prefix)) 
return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) 

tagdictNNS = findtags('NNS', nltk.corpus.brown.tagged_words())

這將返回以下罰款

for tag in sorted(tagdictNNS): 
    print tag, tagdictNNS[tag]

我已成功地返回使用這種每個基於NN字計數：

pluralLists = tagdictNNS.values() 
pluralList = list(itertools.chain(*pluralLists)) 
for s in pluralList: 
    sincident = words.count(s) 
    print s 
    print sincident

返回的一切。

有沒有更好的方式插入字典tagdictNN[tag]發生？

編輯1：

pluralLists = tagdictNNS.values()[:5] 
pluralList = list(itertools.chain(*pluralLists))

返回它們的大小順序從對於s循環。仍然不是正確的做法。

編輯2：更新的字典，所以他們實際上搜索NNS複數。

來源

2012-11-12 AnOnion

退房Python的收藏品中的計數器。 http://docs.python.org/2/library/collections.html – MercuryRising

我可能不懂，但考慮您的tagdictNNS：

>>> new = {} 
>>> for k,v in tagdictNNS.items(): 
     new[k] = len(tagdictNNS[k]) 
>>> new 
{'NNS$-TL-HL': 1, 'NNS-HL': 5, 'NNS$-HL': 4, 'NNS-TL': 5, 'NNS-TL-HL': 5, 'NNS+MD': 2,  'NNS$-NC': 1, 'NNS-TL-NC': 1, 'NNS$-TL': 5, 'NNS': 5, 'NNS$': 5, 'NNS-NC': 5}

然後，你可以這樣做：

>>> sorted(new.items(), key=itemgetter(1), reverse=True)[:2] 
[('NNS-HL', 5), ('NNS-TL', 5)]

來源

2012-11-15 05:09:25 verbsintransit

看起來很接近，但我採用了添加[：5]的方法。但是，給定的時間我會用這種方法。 – AnOnion

Python NLTK - 基於標記返回頂部結果來計算棕色語料庫中單詞的出現次數

回答

相關問題