2013-06-19 77 views
0

我應該統計文檔「individual-articles」中所有文件中字典「d」的所有關鍵值的頻率在這裏,文檔「individual-articles」有大約20000個txt文件,文件名爲1, 2,3,4 ...例如:假設d [英國] = [5,76,289]必須返回英國出現在文件5.txt,76.txt,289.txt中屬於文件「induvidual文章「,並且我還需要在同一文檔中的所有文件中找到它的頻率。如何遍歷字典python中的所有鍵?

import collections 
import sys 
import os 
import re 
sys.stdout=open('dictionary.txt','w') 
from collections import Counter 
from glob import glob 


folderpath='d:/individual-articles' 
counter=Counter() 


filepaths = glob(os.path.join(folderpath,'*.txt')) 

def words_generator(fileobj): 
    for line in fileobj: 
     for word in line.split(): 
      yield word 
word_count_dict = {} 
for file in filepaths: 
    f = open(file,"r") 
    words = words_generator(f) 
    for word in words: 
     if word not in word_count_dict: 
       word_count_dict[word] = {"total":0} 
     if file not in word_count_dict[word]: 
       word_count_dict[word][file] = 0 
     word_count_dict[word][file] += 1    
     word_count_dict[word]["total"] += 1   
for k in word_count_dict.keys(): 
    for filename in word_count_dict[k]: 
     if filename == 'total': continue 
     counter.update(filename) 

for k in word_count_dict.keys(): 
    for count in counter.most_common(): 
     print('{} {}'.format(word_count_dict[k],count)) 

我怎樣才能找到英國的頻率只在那些文件中的那些文件是關鍵字值的字典的元素?

我需要將這些值存儲在另一個D2爲同樣的例子,D2必須包含

(英國,26,1200) (西班牙,52,6795) (法國,45,568)

其中26是文件5.txt,76.txt和289.txt中英國單詞的頻率,1200是所有文件中英國單詞的頻率。 同樣適用於西班牙和法國。

我在這裏使用計數器,我認爲這是缺陷,因爲迄今爲止一切正常,除了我的最終循環!

我是一個蟒蛇新手,我已經嘗試了一點!請幫忙!!

回答

0

word_count_dict["Britain"]是一個常用字典。只是遍歷它:

for filename in word_count_dict["Britain"]: 
    if filename == 'total': continue 
    print("Britain appears in {} {} times".format(filename, word_count_dict["Britain"][filename])) 

或檢索所有的鍵:

word_count_dict["Britain"].keys() 

請注意,您在字典中有一個特殊的鍵total

這可能是因爲您的壓痕是關閉的,但它似乎你是不是正確計算你的文件條目:

if file not in word_count_dict[word]: 
    word_count_dict[word][file] = 0 
    word_count_dict[word][file] += 1    
    word_count_dict[word]["total"] += 1   

將只能算,如果(+= 1)字樣file未曾見過的每個字字典之前;正確的,於:

if file not in word_count_dict[word]: 
    word_count_dict[word][file] = 0 
word_count_dict[word][file] += 1    
word_count_dict[word]["total"] += 1   

擴大這個任意話,你遍歷外word_count_dict

for word, counts in word_count_dict.iteritems(): 
    print('Total counts for word {}: '.format(word, counts['total'])) 
    for filename, count in counts.iteritems(): 
     if filename == 'total': continue 
     print("{} appears in {} {} times".format(word, filename, count)) 
+0

和假設我有這樣多的話,「英國」,「法國」,「西班牙」 ,這樣做的工作:對於word_count_dict.keys()中的k: – radhika

+0

@radhika:確切地說。 'k'本身就是一個將文件名映射到計數的字典。 –

+0

所以這是正確的?對於word_count_dict.keys()中的k:for word_count_dict [k]中的文件名: if filename =='total':continue print(k +「出現在{} {}次」.format(filename,word_count_dict [k] [文件名])) – radhika