2014-12-06 108 views
0

我需要在文本文件中顯示10個最常用的單詞,從最常見到最少以及它使用的次數。我無法使用字典或計數器功能。到目前爲止,我有這樣的:10,字符串中最常見的單詞Python

import urllib 
cnt = 0 
i=0 
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt") 
uniques = [] 
for line in txtFile: 
    words = line.split() 
    for word in words: 
     if word not in uniques: 
      uniques.append(word) 
for word in words: 
    while i<len(uniques): 
     i+=1 
     if word in uniques: 
      cnt += 1 
print cnt 

現在,我想我應該尋找在陣列中「唯一」的每一個字,看看它是如何多次在這個文件重複,然後添加到另一個陣列計數實例的每個字。但這是我卡住的地方。我不知道如何繼續。

任何幫助,將不勝感激。謝謝

+4

這聽起來像是一個家庭作業問題 – Greg 2014-12-06 01:31:19

+1

@Greg它確實如此。不過,SO不歧視家庭作業,所以我沒有看到這個問題? – 2014-12-06 01:32:53

+1

你的代碼有什麼問題?什麼不起作用?你收到什麼錯誤信息?或者你只是想讓別人爲你寫代碼? – 2014-12-06 01:38:48

回答

0

就我個人而言,我會自己執行collections.Counter。我想你知道,對象是如何工作的,但如果沒有,我會總結:

text = "some words that are mostly different but are not all different not at all" 

words = text.split() 

resulting_count = collections.Counter(words) 
# {'all': 2, 
# 'are': 2, 
# 'at': 1, 
# 'but': 1, 
# 'different': 2, 
# 'mostly': 1, 
# 'not': 2, 
# 'some': 1, 
# 'that': 1, 
# 'words': 1} 

我們當然可以那種通過使用sortedkey關鍵字參數基於頻率,並在該列表中返回第10個項目。但是,這對您沒有太大幫助,因爲您沒有實施Counter。我將把這部分作爲練習,向你展示如何實現Counter作爲一個函數而不是一個對象。

def counter(iterable): 
    d = {} 
    for element in iterable: 
     if element in d: 
      d[element] += 1 
     else: 
      d[element] = 1 
    return d 

實際上並不困難。遍歷迭代器的每個元素。如果該元素不在d中,請將其添加到d,值爲1。如果是d,則增加該值。

def counter(iterable): 
    d = {} 
    for element in iterable: 
     d.setdefault(element, 0) += 1 

注意,在您的使用情況下,你可能想去掉標點並可能casefold整個事情(讓someword被算作一樣Someword,而不是兩個單獨的詞:它是由更容易表達)。我會把它留給你,但我會指出str.strip需要一個參數來指出要去掉什麼,而string.punctuation包含了你可能需要的所有標點符號。

+0

謝謝你的幫助。我將如何實現這個功能? 我會照顧剩餘的細節,如排序和剝離,我只需要這部分 – KevinKZ 2014-12-06 01:49:18

+0

@KevinKZ一個​​文件對象已經是它的行的迭代器。我將創建一個生成器函數,該函數將空行上的行和分割,根據需要剝離,並將整個事件傳遞給'counter'函數。像'words =(word.strip(但是)file_obj中line.split()的單詞)''count ='counter(words)' – 2014-12-06 01:51:34

+0

你不想以默認值啓動計數器1,但0.擴展版本是好的,但使用setdefault的應該以0 – chapelo 2014-12-06 02:35:20

2

你在正確的軌道上。請注意,此算法速度很慢,因爲對於每個唯一字,它會迭代所有字。沒有散​​列的快速方法將涉及構建trie

# The following assumes that we already have alice30.txt on disk. 
# Start by splitting the file into lowercase words. 
words = open('alice30.txt').read().lower().split() 

# Get the set of unique words. 
uniques = [] 
for word in words: 
    if word not in uniques: 
    uniques.append(word) 

# Make a list of (count, unique) tuples. 
counts = [] 
for unique in uniques: 
    count = 0    # Initialize the count to zero. 
    for word in words:  # Iterate over the words. 
    if word == unique: # Is this word equal to the current unique? 
     count += 1   # If so, increment the count 
    counts.append((count, unique)) 

counts.sort()   # Sorting the list puts the lowest counts first. 
counts.reverse()   # Reverse it, putting the highest counts first. 
# Print the ten words with the highest counts. 
for i in range(min(10, len(counts))): 
    count, word = counts[i] 
    print('%s %d' % (word, count)) 
0
from string import punctuation #you will need it to strip the punctuation 

import urllib 
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt") 

counter = {} 

for line in txtFile: 
    words = line.split() 
    for word in words: 
     k = word.strip(punctuation).lower() #the The or you You counted only once 
     # you still have words like I've, you're, Alice's 
     # you could change re to are, ve to have, etc... 
     if "'" in k: 
      ks = k.split("'") 
     else: 
      ks = [k,] 
     #now the tally 
     for k in ks: 
      counter[k] = counter.get(k, 0) + 1 
#and sorting the counter by the value which holds the tally 
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]: 
    print word, "\t", counter[word] 
0
import urllib 
import operator 
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines() 
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces 
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces. 

word_counter = {} 
for word in txtFile.split(" "): # split in every space. 
    if len(word) > 0 and word != '\r\n': 
     if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1 
      word_counter[word] = 1 
     else: 
      word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1 

for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]): 
    # sorts the dict by the values, from top to botton, takes the 10 top items, 
    print "%s: %s - %s"%(i+1,word,word_counter[word]) 

輸出:

1: the - 1432 2: and - 734 3: to - 703 4: a - 579 5: of - 501 6: she - 466 7: it - 440 8: said - 434 9: I - 371 10: in - 338

此方法確保了只有字母數字和空格是在計數器。無關緊要。

0

你也可以通過熊貓數據框來做到這一點,並以表格的方式獲得結果:「詞 - 它的頻率」。訂購。

def count_words(words_list): 
words_df = pn.DataFrame(words_list) 
words_df.columns = ["word"] 
words_df_unique = pn.DataFrame(pn.unique(words_list)) 
words_df_unique.columns = ["unique"] 
words_df_unique["count"] = 0 
i = 0 
for word in pn.Series.tolist(words_df_unique.unique): 
    words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word]) 
    i+=1 
res = words_df_unique.sort_values('count', ascending = False) 
return(res) 
+0

因此,您將擁有一個數據框,您可以使用df.head(10)選擇10個最常用的單詞,或者使用df.tail(10)選擇10個最罕見的單詞, – 2017-04-21 17:47:26

相關問題