打印文本文檔中使用python的10個最不經常的單詞

我有一個小型的python腳本，它可以打印文本文檔中最常用的10個單詞（每個單詞爲2個字母或更多），我需要繼續打印腳本文檔中最常見的10個詞也是如此。我有一個相對有效的腳本，除了它打印的10個最不常用的單詞是數字（整數和浮點數）時，它們應該是單詞。我如何迭代單詞並排除數字？這裏是我完整的腳本：打印文本文檔中使用python的10個最不經常的單詞

# Most Frequent Words: 
from string import punctuation 
from collections import defaultdict 

number = 10 
words = {} 

with open("charactermask.txt") as txt_file: 
    words = [x.strip(punctuation).lower() for x in txt_file.read().split()] 

counter = defaultdict(int) 

for word in words: 
    if len(word) >= 2: 
    counter[word] += 1 

top_words = sorted(counter.iteritems(), 
        key=lambda(word, count): (-count, word))[:number] 

for word, frequency in top_words: 
    print "%s: %d" % (word, frequency) 


# Least Frequent Words: 
least_words = sorted(counter.iteritems(), 
        key=lambda (word, count): (count, word))[:number] 

for word, frequency in least_words: 
    print "%s: %d" % (word, frequency)

編輯：的文檔（# Least Frequent Words註釋下的部分）的結束是需要固定的部分。

來源

2012-09-17 Ty Bailey

你將需要一個過濾器 - 改變不過正則表達式來匹配你要定義一個「字」：

import re 
alphaonly = re.compile(r"^[a-z]{2,}$")

現在，你想要的詞頻表不包括數字在第一位？

counter = defaultdict(int) 

with open("charactermask.txt") as txt_file: 
    for line in txt_file: 
     for word in line.strip().split(): 
      word = word.strip(punctuation).lower() 
      if alphaonly.match(word): 
       counter[word] += 1

或者你只是想從表中提取至少頻繁的話的時候跳過的數字？

words_by_freq = sorted(counter.iteritems(), 
         key=lambda(word, count): (count, word)) 

i = 0 
for word, frequency in words_by_freq: 
    if alphaonly.match(word): 
     i += 1 
     sys.stdout.write("{}: {}\n".format(word, frequency)) 
    if i == number: break

來源

2012-09-17 02:08:32 zwol

這也適用。感謝您的全面回答。 –

您需要一個函數letters_only()，它將運行匹配[0-9]的正則表達式，如果找到任何匹配項，則返回False。像這樣的東西::

def letters_only(word): 
    return re.search(r'[0-9]', word) is None

然後，在你說for word in words，而不是說for word in filter(letters_only, words)。

來源

2012-09-17 02:01:55 syrion

太棒了，您的答案完美無缺！ –

太好了。也改變了我對wim提出的簡短形式的回答;我認爲更長的形式更清晰，但也許它只是我需要驅除的代碼抽動。 :) downvote有點困惑，但它如此。 – syrion

打印文本文檔中使用python的10個最不經常的單詞

回答

相關問題