2011-11-15 115 views
1

我想知道計算文檔中單詞的最佳方式。如果我有自己的「corp.txt」語料庫設置,並且想知道文件「corp.txt」中出現「學生,信任,艾爾」的頻率如何。我可以使用什麼?如何統計文檔中的單詞

難道是下列之一:

.... 
full=nltk.Text(mycorpus.words('FullReport.txt')) 
>>> fdist= FreqDist(full) 
>>> fdist 
<FreqDist with 34133 outcomes> 
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full. 

感謝, 雷

+1

這兩種標準都不是由標準python庫提供的。你確定你沒有考慮NLTK嗎? –

+0

看着你的名字,我會假裝你知道「學生信任艾爾」是什麼意思。無論如何,我會用'FreqDist'去。 'fdist = FreqDist();在tokenize.whitespace(發送)中輸入:fdist.inc(word.lower())'。你可以檢查文檔[這裏](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html)。 – aayoubi

+0

我編輯了答案,請仔細檢查我的答案。謝謝 –

回答

3

大多數人只會使用defaultdictionary(爲0的默認值)。每當看到一個單詞時,只需將該值加1:

total = 0 
count = defaultdict(lambda: 0) 
for word in words: 
    total += 1 
    count[word] += 1 

# Now you can just determine the frequency by dividing each count by total 
for word, ct in count.items(): 
    print('Frequency of %s: %f%%' % (word, 100.0 * float(ct)/float(total))) 
+0

你的意思是'defaultdict(int)' - - 'defaultdict'需要可調用。 – kindall

+0

啊是的,謝謝。 –

+0

@Chris如何使用'Counter'? – alvas

2

您幾乎就在那裏!你可以使用索引你感興趣的字FreqDist 嘗試以下操作:

print fdist['students'] 
print fdist['ayre'] 
print fdist['full'] 

這給你的每個單詞出現的次數或數量。 你說「怎麼頻頻」 - 頻率是出現的次數不同 - 可以得到這樣的:

print fdist.freq('students') 
print fdist.freq('ayre') 
print fdist.freq('full') 
3

我會建議尋找到collections.Counter。特別是對於大量的文本,這個技巧並且只受可用內存的限制。它在計算機上使用12Gb的RAM在一天半的時間內計算了30億個令牌。僞代碼(可變字將在實踐中有一些參考文件或類似的):

from collections import Counter 
my_counter = Counter() 
for word in Words: 
    my_counter.update(word) 

完成後的詞語是在其然後可以被寫入到磁盤或存儲在別處的字典my_counter(SQLITE例如)。

0

可以讀取一個文件,然後記號化,把單獨的標記成NLTK一個FreqDist對象,看http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist 
from nltk import word_tokenize 

# Creates a test file for reading. 
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!" 
with open('test.txt', 'w') as fout: 
    fout.write(doc) 

# Reads a file into FreqDist object. 
fdist = FreqDist() 
with open('test.txt', 'r') as fin: 
    for word in word_tokenize(fin.read()): 
     fdist.inc(word) 

print "'blah' occurred", fdist['blah'], "times" 

[出]:

'blah' occurred 3 times 

或者,你可以使用原產Counter對象從collections,你得到相同的計數,見https://docs.python.org/2/library/collections.html。請注意,FreqDist或Counter對象中的鍵區分大小寫,因此您可能還需要將標記大小寫爲小寫:

from collections import Counter 
from nltk import word_tokenize 

# Creates a test file for reading. 
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!" 
with open('test.txt', 'w') as fout: 
    fout.write(doc) 

# Reads a file into FreqDist object. 
fdist = Counter() 
with open('test.txt', 'r') as fin: 
    fdist.update(word_tokenize(fin.read().lower())) 

print "'blah' occurred", fdist['blah'], "times" 
相關問題