2015-10-26 40 views
0

我一直在嘗試檢測文字片段上的word/bigram趨勢。到目前爲止,我所做的是刪除停用詞,降低詞頻並獲得詞頻,並將每個文本的最常用30個附加到列表中。獲取在文檔中找到的詞頻的累計計數

[(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1),...] 

然後我轉換上面的列表包含所有單詞和他們的每個文檔的頻率以及我現在需要做的是找回排序表,即一個巨大的名單:

[(u'snow', 32), (u'said.', 12), (u'GoT', 10), (u'death', 8), (u'entertainment', 4)..] 

任何想法?

代碼:

fdists = [] 
for i in texts: 
    words = FreqDist(w.lower() for w in i.split() if w.lower() not in stopwords) 
    fdists.append(words.most_common(30)) 

all_in_one = [item for sublist in fdists for item in sublist] 
+0

爲什麼你不用字典? – SirParselot

+0

從一開始爲了捕獲每個唯一字的出現或在for循環之後? – Swan87

+0

我嘗試使用collection.Counter從一開始,但它需要永遠執行.. – Swan87

回答

0

,如果你想要做的是那種你的列表,你可以使用

import operator 

fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)] 
fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)] 
fdists += fdists2 

fdict = {} 
for i in fdists: 
    if i[0] in fdict: 
     fdict[i[0]] += i[1] 
    else: 
     fdict[i[0]] = i[1] 

sorted_f = sorted(fdict.items(), key=operator.itemgetter(1), reverse=True) 
print sorted_f[:30] 

[(u'said.', 6), (u'seeing', 5), (u'death', 4), (u'entertainment', 4), (u'read', 4), (u'it\u2019s', 4), (u'weiss', 4), (u'one', 4), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)] 

您可以處理重複的另一種方法是使用熊貓groupby()功能,然後使用sort()功能按countword這樣排序

from pandas import * 
import pandas as pd 

fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)] 
fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)] 
fdists += fdists2 

df = DataFrame(data = fdists, columns = ['word','count']) 
df= DataFrame([{'word': k, 'count': (v['count'].sum())} for k,v in df.groupby(['word'])], columns = ['word','count']) 

Sorted = df.sort(['count','word'], ascending = [0,1]) 
print Sorted[:30] 

      word count 
8   said.  6 
9   seeing  5 
2   death  4 
3 entertainment  4 
4   it’s  4 
5    one  4 
7   read  4 
12   weiss  4 
0   bloody  1 
1   dead,」  1 
6   people  1 
10   shot  1 
11   show’s  1 
13   「it  1 
+0

排序這種方式會找到單詞的累計計數?我的意思是,如果在兩個文件中分別提到死亡這個詞,分別是5次和2次,那麼最後「死亡」這個詞的單詞計數將是7,否則會有2個單獨的條目?謝謝 – Swan87

+0

@ Swan87我更新了熊貓的答案,以重複記錄 – SirParselot

+0

@ Swan87兩個答案都更新了,但我個人比較喜歡熊貓。我認爲它看起來更乾淨,如果你想用它做其他事情,你可以比列表更容易操作你的數據框。 – SirParselot