2014-02-18 89 views
2

我想從文本文件中提取每個單詞,並計算字典中的單詞頻率。計算單詞頻率並從中製作詞典

例子:'this is the textfile, and it is used to take words and count'

d = {'this': 1, 'is': 2, 'the': 1, ...} 

我不是那麼遠,但我看不出如何完成它。到目前爲止我的代碼:

import sys 

argv = sys.argv[1] 
data = open(argv) 
words = data.read() 
data.close() 
wordfreq = {} 
for i in words: 
    #there should be a counter and somehow it must fill the dict. 
+1

從這裏開始:http://docs.python.org/2/library/collections.html#counter-objects。您還需要「拆分」您的輸入,獲取單個詞,並刪除任何標點符號,請參閱:http://docs.python.org/2/library/stdtypes.html#string-methods – jonrsharpe

回答

2

如果你不想使用collections.Counter,您可以編寫自己的功能:

import sys 

filename = sys.argv[1] 
fp = open(filename) 
data = fp.read() 
words = data.split() 
fp.close() 

unwanted_chars = ".,-_ (and so on)" 
wordfreq = {} 
for raw_word in words: 
    word = raw_word.strip(unwanted_chars) 
    if word not in wordfreq: 
     wordfreq[word] = 0 
    wordfreq[word] += 1 

爲了更好的東西,看看正則表達式。

2
from collections import Counter 
t = 'this is the textfile, and it is used to take words and count' 

dict(Counter(t.split())) 
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1} 

或者具有計數之前刪除標點更好:

dict(Counter(t.replace(',', '').replace('.', '').split())) 
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1} 
1

以下內容將字符串拆分成一個帶有split()的列表,用於循環列表,並使用Python的count函數count()計算句子中每個項目的頻率 。單詞i和它的頻率作爲元組放置在一個空列表ls中,然後用dict()轉換成 鍵和值對。

sentence = 'this is the textfile, and it is used to take words and count'.split() 
ls = [] 
for i in sentence: 

    word_count = sentence.count(i) # Pythons count function, count() 
    ls.append((i,word_count))  


dict_ = dict(ls) 

print dict_ 

輸出; {'和':2,'count':1,'used':1,'this':1,'is':2,'it':1,'to':1,'take':1,''字'':1,'the':1,'textfile,':1}

+4

這是相當低效的,因爲它會爲每個單詞再次遍歷整個字符串,而不是一次遍歷。複雜度將是O(n²)而不是O(n)。 – Michael

4

雖然,使用Counter from collections庫建議@Michael是更好的方法,但我添加的答案只是爲了改善您的代碼(我相信這將成爲新Python學習者的答案):

從您的代碼中的評論它看起來像你想改善你的代碼。我認爲你可以用文字閱讀文件內容(雖然通常我會避免使用read()函數並使用for line in file_descriptor:類型的代碼)。

作爲words是一個字符串,在for循環中,for i in words:環路變量i不是一個字,但一個字符。您正在字符串中迭代字符,而不是迭代字符串words中的單詞。要理解這個通知如下代碼鷸:

>>> for i in "Hi, h r u?": 
... print i 
... 
H 
i 
, 

h 

r 

u 
? 
>>> 

因爲字符循環訪問串字符,而不是用語言文字是不是你想要的,你所應該從字符串類在Python分裂法/虛詞迭代的話。返回串中的所有的單詞的列表
str.split(str="", num=string.count(str))方法,
使用str作爲分離器(分裂上的所有空格如果未指定),任選限制性分裂爲num的數量。下面的代碼示例

注意:

斯普利特:

>>> "Hi, how are you?".split() 
['Hi,', 'how', 'are', 'you?'] 

循環與分裂:

>>> for i in "Hi, how are you?".split(): 
... print i 
... 
Hi, 
how 
are 
you? 

它看起來像你的需要。除了字Hi,,因爲默認情況下split()被空白分隔,因此Hi,保留爲您不想要的單個字符串(顯然)。計算文件中單詞的頻率。

一個好的解決方案可以是使用正則表達式,但首先保持答案簡單我回答replace()方法。方法str.replace(old, new[, max])返回字符串的一個副本,其中old的出現已被new替換,可選地將替換次數限制爲max。

現在檢查下面的代碼示例什麼,我想建議:

>>> "Hi, how are you?".split() 
['Hi,', 'how', 'are', 'you?'] # it has , with Hi 
>>> "Hi, how are you?".replace(',', ' ').split() 
['Hi', 'how', 'are', 'you?'] # , replaced by space then split 

循環:現在

>>> for word in "Hi, how are you?".replace(',', ' ').split(): 
... print word 
... 
Hi 
how 
are 
you? 

怎麼算頻率:

的一種方法是使用計數器正如邁克爾建議的那樣,但是要使用你想要從空字典開始的方法。做這樣的事情代碼:

words = f.read() 
wordfreq = {} 
for word in .replace(', ',' ').split(): 
    wordfreq[word] = wordfreq.setdefault(word, 0) + 1 
    #    ^^ add 1 to 0 or old value from dict 

我在做什麼?:因爲最初wordfreq是空的,你不能在第一次分配給wordfreq[word](將上升鍵除外)。所以我使用了setdefault dict方法。

dict.setdefault(key, default=None)get()類似,但如果密鑰尚未在dict中,將設置爲dict[key]=default。因此,第一次當一個新的單詞出現時,我使用setdefault在字典中使用0來設置它,然後添加1,並將其分配給相同的字典。

我用with open而不是單獨的open編寫了一個等效的代碼。

with open('~/Desktop/file') as f: 
    words = f.read() 
    wordfreq = {} 
    for word in words.replace(',', ' ').split(): 
     wordfreq[word] = wordfreq.setdefault(word, 0) + 1 
print wordfreq 

運行這樣的:

$ cat file # file is 
this is the textfile, and it is used to take words and count 
$ python work.py # indented manually 
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 
'it': 1, 'to': 1, 'take': 1, 'words': 1, 
'the': 1, 'textfile': 1} 

使用re.split(pattern, string, maxsplit=0, flags=0)

for循環只要改變:for i in re.split(r"[,\s]+", words):,應該產生正確的輸出。

編輯:最好找到所有字母數字字符,因爲您可能有多個標點符號。

>>> re.findall(r'[\w]+', words) # manually indent output 
['this', 'is', 'the', 'textfile', 'and', 
    'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count'] 

使用for循環:for word in re.findall(r'[\w]+', words):

我將如何編寫代碼,而無需使用read()

文件是:

$ cat file 
This is the text file, and it is used to take words and count. And multiple 
Lines can be present in this file. 
It is also possible that Same words repeated in with capital letters. 

代碼是:

$ cat work.py 
import re 
wordfreq = {} 
with open('file') as f: 
    for line in f: 
     for word in re.findall(r'[\w]+', line.lower()): 
      wordfreq[word] = wordfreq.setdefault(word, 0) + 1 

print wordfreq 

用於將lower()轉換爲低位字母。

輸出:

$python work.py # manually strip output 
{'and': 3, 'letters': 1, 'text': 1, 'is': 3, 
'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1, 
'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1, 
'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1, 
'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2, 
'lines': 1, 'can': 1, 'the': 1} 
0
sentence = "this is the textfile, and it is used to take words and count" 

# split the sentence into words. 
# iterate thorugh every word 

counter_dict = {} 
for word in sentence.lower().split(): 
# add the word into the counter_dict initalize with 0 
    if word not in counter_dict: 
    counter_dict[word] = 0 
# increase its count by 1 
    counter_dict[word] =+ 1