如何統計文檔中的單詞

我想知道計算文檔中單詞的最佳方式。如果我有自己的「corp.txt」語料庫設置，並且想知道文件「corp.txt」中出現「學生，信任，艾爾」的頻率如何。我可以使用什麼？如何統計文檔中的單詞

難道是下列之一：

.... 
full=nltk.Text(mycorpus.words('FullReport.txt')) 
>>> fdist= FreqDist(full) 
>>> fdist 
<FreqDist with 34133 outcomes> 
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

感謝，雷

來源

2011-11-15 Ray Hmar

這兩種標準都不是由標準python庫提供的。你確定你沒有考慮NLTK嗎？ –

看着你的名字，我會假裝你知道「學生信任艾爾」是什麼意思。無論如何，我會用'FreqDist'去。 'fdist = FreqDist（）;在tokenize.whitespace（發送）中輸入：fdist.inc（word.lower（））'。你可以檢查文檔[這裏]（http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html）。 – aayoubi

我編輯了答案，請仔細檢查我的答案。謝謝 –

大多數人只會使用defaultdictionary（爲0的默認值）。每當看到一個單詞時，只需將該值加1：

total = 0 
count = defaultdict(lambda: 0) 
for word in words: 
    total += 1 
    count[word] += 1 

# Now you can just determine the frequency by dividing each count by total 
for word, ct in count.items(): 
    print('Frequency of %s: %f%%' % (word, 100.0 * float(ct)/float(total)))

來源

2011-11-15 16:01:42

你的意思是'defaultdict（int）' - - 'defaultdict'需要可調用。 – kindall

啊是的，謝謝。 –

@Chris如何使用'Counter'？ – alvas

您幾乎就在那裏！你可以使用索引你感興趣的字FreqDist 嘗試以下操作：

print fdist['students'] 
print fdist['ayre'] 
print fdist['full']

這給你的每個單詞出現的次數或數量。你說「怎麼頻頻」 - 頻率是出現的次數不同 - 可以得到這樣的：

print fdist.freq('students') 
print fdist.freq('ayre') 
print fdist.freq('full')

來源

2012-07-11 18:41:29 Spaceghost

我會建議尋找到collections.Counter。特別是對於大量的文本，這個技巧並且只受可用內存的限制。它在計算機上使用12Gb的RAM在一天半的時間內計算了30億個令牌。僞代碼（可變字將在實踐中有一些參考文件或類似的）：

from collections import Counter 
my_counter = Counter() 
for word in Words: 
    my_counter.update(word)

完成後的詞語是在其然後可以被寫入到磁盤或存儲在別處的字典my_counter（SQLITE例如）。

來源

2014-04-06 19:06:14

可以讀取一個文件，然後記號化，把單獨的標記成NLTK一個FreqDist對象，看http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist 
from nltk import word_tokenize 

# Creates a test file for reading. 
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!" 
with open('test.txt', 'w') as fout: 
    fout.write(doc) 

# Reads a file into FreqDist object. 
fdist = FreqDist() 
with open('test.txt', 'r') as fin: 
    for word in word_tokenize(fin.read()): 
     fdist.inc(word) 

print "'blah' occurred", fdist['blah'], "times"

[出]：

'blah' occurred 3 times

或者，你可以使用原產Counter對象從collections，你得到相同的計數，見https://docs.python.org/2/library/collections.html。請注意，FreqDist或Counter對象中的鍵區分大小寫，因此您可能還需要將標記大小寫爲小寫：

from collections import Counter 
from nltk import word_tokenize 

# Creates a test file for reading. 
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!" 
with open('test.txt', 'w') as fout: 
    fout.write(doc) 

# Reads a file into FreqDist object. 
fdist = Counter() 
with open('test.txt', 'r') as fin: 
    fdist.update(word_tokenize(fin.read().lower())) 

print "'blah' occurred", fdist['blah'], "times"

來源

2014-04-07 05:10:15 alvas

如何統計文檔中的單詞

回答

相關問題