二元語法和單詞

的行列我使用這個代碼來獲得雙字母組頻率：二元語法和單詞

text1='the cat jumped over the dog in the dog house' 
text=text1.split() 

counts = defaultdict(int) 
for pair in nltk.bigrams(text): 
    counts[pair] +=1 

for c, pair in ((c, pair) for pair, c in counts.iteritems()): 
    print pair, c

輸出是：

('the', 'cat') 1 
('dog', 'in') 1 
('cat', 'jumped') 1 
('jumped', 'over') 1 
('in', 'the') 1 
('over', 'the') 1 
('dog', 'house') 1 
('the', 'dog') 2

我需要的是對上市的二元語法，但而不是每個單詞，我需要將單詞的等級打印出來。當我的意思是「排名」時，我的意思是，頻率最高的詞有第一名，第二名有第二名......這裏的排名是：1. 2.狗和具有相同頻率的詞按照降序排列。的3.cat 4.jumped 5.over等。

如

1 3 1

代替

('the', 'cat') 1

我認爲要做到這一點，我需要用文字和他們的排名一本字典，但我被卡住了，不知道如何繼續。我所擁有的是：

fd=FreqDist() 
ranks=[] 
rank=0 
for word in text: 
    fd.inc(word) 
for rank, word in enumerate(fd): 
    ranks.append(rank+1) 

word_rank = {} 
for word in text: 
    word_rank[word] = ranks 

print ranks

來源

2012-01-19 Julia

爲什麼'（''''cat'）1' =>'1 3 1' ,?爲什麼「貓」3？不應該是2嗎？（「貓」是你的文本中的第二個詞） – juliomalegria

當我的意思是「等級」時，我的意思是，頻率最高的詞有第一名，第二名有第二名等......這裏的排名是：1。狗和相同頻率的人按照降序排列。 3.cat 4.jumped 5.over ect ... – Julia

如果您有「狗狗狗」，「狗狗」會排在''之前，因爲第一個'狗'在第一個''之前出現' –

假設counts已經創建，下面應該得到的結果你想要的：

freq = defaultdict(int) 
for word in text: 
    freq[word] += 1 

ranks = sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k))) 
ranks = dict(zip(ranks, range(1, len(ranks)+1))) 

for (a, b), count in counts.iteritems(): 
    print ranks[a], ranks[b], count

輸出：

這裏有一些中間值是可能有助於理解它是如何工作的：

>>> dict(freq) 
{'house': 1, 'jumped': 1, 'over': 1, 'dog': 2, 'cat': 1, 'in': 1, 'the': 3} 
>>> sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k))) 
['the', 'dog', 'cat', 'jumped', 'over', 'in', 'house'] 
>>> dict(zip(ranks, range(1, len(ranks)+1))) 
{'house': 7, 'jumped': 4, 'over': 5, 'dog': 2, 'cat': 3, 'in': 6, 'the': 1}

來源

2012-01-19 19:19:02

太棒了，非常感謝你！ – Julia

後續問題：如何將生成的矩陣存儲到文件中？謝謝！ – Julia

關於[如何寫入文件]有幾個問題（http://stackoverflow.com/search?q=python+write+to+file），如果你仍然卡住，可以隨意問一個單獨的問題。 –

text1='the cat jumped over the dog in the dog house'.split(' ') 
word_to_rank={} 
for i,word in enumerate(text1): 
    if word not in word_to_rank: 
     word_to_rank[word]=i+1 

from collections import Counter 
word_to_frequency=Counter(text1) 

word_to_tuple={} 
for word in word_to_rank: 
    word_to_tuple[word]=(-word_to_frequency[word],word_to_rank[word]) 

tuple_to_word=dict(zip(word_to_tuple.values(),word_to_tuple.keys())) 

sorted_by_conditions=sorted(tuple_to_word.keys()) 

word_to_true_rank={} 
for i,_tuple in enumerate(sorted_by_conditions): 
    word_to_true_rank[tuple_to_word[_tuple]]=i+1 

def fix(pair,c): 
    return word_to_true_rank[pair[0]],word_to_true_rank[pair[1]],c 

pair=('the', 'cat') 
c=1 
print fix(pair,c) 

pair=('the', 'dog') 
c=2 
print fix(pair,c) 


>>> 
(1, 3, 1) 
(1, 2, 2)

來源

2012-01-19 19:25:15

二元語法和單詞

回答

相關問題