我有一個小型的python腳本,它可以打印文本文檔中最常用的10個單詞(每個單詞爲2個字母或更多),我需要繼續打印腳本文檔中最常見的10個詞也是如此。我有一個相對有效的腳本,除了它打印的10個最不常用的單詞是數字(整數和浮點數)時,它們應該是單詞。我如何迭代單詞並排除數字?這裏是我完整的腳本:打印文本文檔中使用python的10個最不經常的單詞
# Most Frequent Words:
from string import punctuation
from collections import defaultdict
number = 10
words = {}
with open("charactermask.txt") as txt_file:
words = [x.strip(punctuation).lower() for x in txt_file.read().split()]
counter = defaultdict(int)
for word in words:
if len(word) >= 2:
counter[word] += 1
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
編輯:的文檔(# Least Frequent Words
註釋下的部分)的結束是需要固定的部分。
這也適用。感謝您的全面回答。 –