可以讀取一個文件,然後記號化,把單獨的標記成NLTK
一個FreqDist
對象,看http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html
from nltk.probability import FreqDist
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
for word in word_tokenize(fin.read()):
fdist.inc(word)
print "'blah' occurred", fdist['blah'], "times"
[出]:
'blah' occurred 3 times
或者,你可以使用原產Counter
對象從collections
,你得到相同的計數,見https://docs.python.org/2/library/collections.html。請注意,FreqDist或Counter對象中的鍵區分大小寫,因此您可能還需要將標記大小寫爲小寫:
from collections import Counter
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
fdist.update(word_tokenize(fin.read().lower()))
print "'blah' occurred", fdist['blah'], "times"
這兩種標準都不是由標準python庫提供的。你確定你沒有考慮NLTK嗎? –
看着你的名字,我會假裝你知道「學生信任艾爾」是什麼意思。無論如何,我會用'FreqDist'去。 'fdist = FreqDist();在tokenize.whitespace(發送)中輸入:fdist.inc(word.lower())'。你可以檢查文檔[這裏](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html)。 – aayoubi
我編輯了答案,請仔細檢查我的答案。謝謝 –