2012-09-17 112 views
0

我有一個小型的python腳本,它可以打印文本文檔中最常用的10個單詞(每個單詞爲2個字母或更多),我需要繼續打印腳本文檔中最常見的10個詞也是如此。我有一個相對有效的腳本,除了它打印的10個最不常用的單詞是數字(整數和浮點數)時,它們應該是單詞。我如何迭代單詞並排除數字?這裏是我完整的腳本:打印文本文檔中使用python的10個最不經常的單詞

# Most Frequent Words: 
from string import punctuation 
from collections import defaultdict 

number = 10 
words = {} 

with open("charactermask.txt") as txt_file: 
    words = [x.strip(punctuation).lower() for x in txt_file.read().split()] 

counter = defaultdict(int) 

for word in words: 
    if len(word) >= 2: 
    counter[word] += 1 

top_words = sorted(counter.iteritems(), 
        key=lambda(word, count): (-count, word))[:number] 

for word, frequency in top_words: 
    print "%s: %d" % (word, frequency) 


# Least Frequent Words: 
least_words = sorted(counter.iteritems(), 
        key=lambda (word, count): (count, word))[:number] 

for word, frequency in least_words: 
    print "%s: %d" % (word, frequency) 

編輯:的文檔(# Least Frequent Words註釋下的部分)的結束是需要固定的部分。

回答

1

你將需要一個過濾器 - 改變不過正則表達式來匹配你要定義一個「字」:

import re 
alphaonly = re.compile(r"^[a-z]{2,}$") 

現在,你想要的詞頻表不包括數字在第一位?

counter = defaultdict(int) 

with open("charactermask.txt") as txt_file: 
    for line in txt_file: 
     for word in line.strip().split(): 
      word = word.strip(punctuation).lower() 
      if alphaonly.match(word): 
       counter[word] += 1 

或者你只是想從表中提取至少頻繁的話的時候跳過的數字

words_by_freq = sorted(counter.iteritems(), 
         key=lambda(word, count): (count, word)) 

i = 0 
for word, frequency in words_by_freq: 
    if alphaonly.match(word): 
     i += 1 
     sys.stdout.write("{}: {}\n".format(word, frequency)) 
    if i == number: break 
+0

這也適用。感謝您的全面回答。 –

1

您需要一個函數letters_only(),它將運行匹配[0-9]的正則表達式,如果找到任何匹配項,則返回False。像這樣的東西::

def letters_only(word): 
    return re.search(r'[0-9]', word) is None 

然後,在你說for word in words,而不是說for word in filter(letters_only, words)

+0

太棒了,您的答案完美無缺! –

+0

太好了。也改變了我對wim提出的簡短形式的回答;我認爲更長的形式更清晰,但也許它只是我需要驅除的代碼抽動。 :) downvote有點困惑,但它如此。 – syrion

相關問題