10，字符串中最常見的單詞Python

我需要在文本文件中顯示10個最常用的單詞，從最常見到最少以及它使用的次數。我無法使用字典或計數器功能。到目前爲止，我有這樣的：10，字符串中最常見的單詞Python

import urllib 
cnt = 0 
i=0 
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt") 
uniques = [] 
for line in txtFile: 
    words = line.split() 
    for word in words: 
     if word not in uniques: 
      uniques.append(word) 
for word in words: 
    while i<len(uniques): 
     i+=1 
     if word in uniques: 
      cnt += 1 
print cnt

現在，我想我應該尋找在陣列中「唯一」的每一個字，看看它是如何多次在這個文件重複，然後添加到另一個陣列計數實例的每個字。但這是我卡住的地方。我不知道如何繼續。

任何幫助，將不勝感激。謝謝

來源

2014-12-06 KevinKZ

這聽起來像是一個家庭作業問題 – Greg 2014-12-06 01:31:19

@Greg它確實如此。不過，SO不歧視家庭作業，所以我沒有看到這個問題？ – 2014-12-06 01:32:53

你的代碼有什麼問題？什麼不起作用？你收到什麼錯誤信息？或者你只是想讓別人爲你寫代碼？ – 2014-12-06 01:38:48

就我個人而言，我會自己執行collections.Counter。我想你知道，對象是如何工作的，但如果沒有，我會總結：

text = "some words that are mostly different but are not all different not at all" 

words = text.split() 

resulting_count = collections.Counter(words) 
# {'all': 2, 
# 'are': 2, 
# 'at': 1, 
# 'but': 1, 
# 'different': 2, 
# 'mostly': 1, 
# 'not': 2, 
# 'some': 1, 
# 'that': 1, 
# 'words': 1}

我們當然可以那種通過使用sorted的key關鍵字參數基於頻率，並在該列表中返回第10個項目。但是，這對您沒有太大幫助，因爲您沒有實施Counter。我將把這部分作爲練習，向你展示如何實現Counter作爲一個函數而不是一個對象。

def counter(iterable): 
    d = {} 
    for element in iterable: 
     if element in d: 
      d[element] += 1 
     else: 
      d[element] = 1 
    return d

實際上並不困難。遍歷迭代器的每個元素。如果該元素不在d中，請將其添加到d，值爲1。如果是d，則增加該值。

def counter(iterable): 
    d = {} 
    for element in iterable: 
     d.setdefault(element, 0) += 1

注意，在您的使用情況下，你可能想去掉標點並可能casefold整個事情（讓someword被算作一樣Someword，而不是兩個單獨的詞：它是由更容易表達）。我會把它留給你，但我會指出str.strip需要一個參數來指出要去掉什麼，而string.punctuation包含了你可能需要的所有標點符號。

來源

2014-12-06 01:42:09

謝謝你的幫助。我將如何實現這個功能？我會照顧剩餘的細節，如排序和剝離，我只需要這部分 – KevinKZ 2014-12-06 01:49:18

@KevinKZ一個文件對象已經是它的行的迭代器。我將創建一個生成器函數，該函數將空行上的行和分割，根據需要剝離，並將整個事件傳遞給'counter'函數。像'words =（word.strip（但是）file_obj中line.split（）的單詞）''count ='counter（words）' – 2014-12-06 01:51:34

你不想以默認值啓動計數器1，但0.擴展版本是好的，但使用setdefault的應該以0 – chapelo 2014-12-06 02:35:20

你在正確的軌道上。請注意，此算法速度很慢，因爲對於每個唯一字，它會迭代所有字。沒有散列的快速方法將涉及構建trie。

# The following assumes that we already have alice30.txt on disk. 
# Start by splitting the file into lowercase words. 
words = open('alice30.txt').read().lower().split() 

# Get the set of unique words. 
uniques = [] 
for word in words: 
    if word not in uniques: 
    uniques.append(word) 

# Make a list of (count, unique) tuples. 
counts = [] 
for unique in uniques: 
    count = 0    # Initialize the count to zero. 
    for word in words:  # Iterate over the words. 
    if word == unique: # Is this word equal to the current unique? 
     count += 1   # If so, increment the count 
    counts.append((count, unique)) 

counts.sort()   # Sorting the list puts the lowest counts first. 
counts.reverse()   # Reverse it, putting the highest counts first. 
# Print the ten words with the highest counts. 
for i in range(min(10, len(counts))): 
    count, word = counts[i] 
    print('%s %d' % (word, count))

來源

2014-12-06 01:51:46

from string import punctuation #you will need it to strip the punctuation 

import urllib 
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt") 

counter = {} 

for line in txtFile: 
    words = line.split() 
    for word in words: 
     k = word.strip(punctuation).lower() #the The or you You counted only once 
     # you still have words like I've, you're, Alice's 
     # you could change re to are, ve to have, etc... 
     if "'" in k: 
      ks = k.split("'") 
     else: 
      ks = [k,] 
     #now the tally 
     for k in ks: 
      counter[k] = counter.get(k, 0) + 1 
#and sorting the counter by the value which holds the tally 
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]: 
    print word, "\t", counter[word]

來源

2014-12-06 02:32:28 chapelo

import urllib 
import operator 
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines() 
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces 
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces. 

word_counter = {} 
for word in txtFile.split(" "): # split in every space. 
    if len(word) > 0 and word != '\r\n': 
     if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1 
      word_counter[word] = 1 
     else: 
      word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1 

for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]): 
    # sorts the dict by the values, from top to botton, takes the 10 top items, 
    print "%s: %s - %s"%(i+1,word,word_counter[word])

輸出：

1: the - 1432 2: and - 734 3: to - 703 4: a - 579 5: of - 501 6: she - 466 7: it - 440 8: said - 434 9: I - 371 10: in - 338

此方法確保了只有字母數字和空格是在計數器。無關緊要。

來源

2014-12-06 03:40:09

你也可以通過熊貓數據框來做到這一點，並以表格的方式獲得結果：「詞 - 它的頻率」。訂購。

def count_words(words_list): 
words_df = pn.DataFrame(words_list) 
words_df.columns = ["word"] 
words_df_unique = pn.DataFrame(pn.unique(words_list)) 
words_df_unique.columns = ["unique"] 
words_df_unique["count"] = 0 
i = 0 
for word in pn.Series.tolist(words_df_unique.unique): 
    words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word]) 
    i+=1 
res = words_df_unique.sort_values('count', ascending = False) 
return(res)

來源

2017-04-21 17:45:42

因此，您將擁有一個數據框，您可以使用df.head（10）選擇10個最常用的單詞，或者使用df.tail（10）選擇10個最罕見的單詞， – 2017-04-21 17:47:26

10，字符串中最常見的單詞Python

回答

相關問題