2016-07-20 44 views
-2

假設我有一個像找到3個字最頻繁的組中的句子

text="I came from the moon. He went to the other room. She went to the drawing room." 

這裏的3個字最頻繁羣組的文字是"went to the"

我知道如何找到最頻繁的bigramtrigram但我被困在這個。我想找到這個解決方案without using NLTK library

+0

你有什麼理由不想使用NLTK嗎? – Ares

+0

jst嘗試沒有nltk兄弟.. –

+2

爲什麼這個問題擱置?三個人提供了合適的答案,所以他們顯然明白要求什麼。 –

回答

1
import string 

text="I came from the moon. He went to the other room. She went to the drawing room." 

for character in string.punctuation: 
    text = text.replace(character, " ") 

while text != text.replace(" ", " "): 
    text = text.replace(" ", " ") 

text = text.split(" ") 

wordlist = [] 
frequency_dict = dict() 

for i in range(len(text)-3): 
    wordlist.append([text[i], text[i+1], text[i+2]]) 

for three_words in wordlist: 
    frequency= wordlist.count(three_words) 
    frequency_dict[", ".join(three_words)] = frequency 

print max(frequency_dict, key=frequency_dict.get), frequency_dict[max(frequency_dict, key=frequency_dict.get)] 

輸出:went, to, the 2

不幸的是列表不是可哈希。否則,它將有助於創建一組three_words項目。

1

nltk使得這個問題微不足道,但看到因爲你不希望這樣的依賴不,我已經包括了只用核心庫的簡單實現。該代碼適用於python2.7和python3.x,並使用collections.Counter來計算n-grams的頻率。在計算上,它是O(NM),其中N是文本中的詞的數量,M是被計數的n-gram的數量(所以如果有人計算uni和bigrams,M = 2)。

import collections 
import re 
import sys 
import time 


# Convert a string to lowercase and split into words (w/o punctuation) 
def tokenize(string): 
    return re.findall(r'\w+', string.lower()) 


def count_ngrams(lines, min_length=2, max_length=4): 
    lengths = range(min_length, max_length + 1) 
    ngrams = {length: collections.Counter() for length in lengths} 
    queue = collections.deque(maxlen=max_length) 

    # Helper function to add n-grams at start of current queue to dict 
    def add_queue(): 
     current = tuple(queue) 
     for length in lengths: 
      if len(current) >= length: 
       ngrams[length][current[:length]] += 1 

    # Loop through all lines and words and add n-grams to dict 
    for line in lines: 
     for word in tokenize(line): 
      queue.append(word) 
      if len(queue) >= max_length: 
       add_queue() 

    # Make sure we get the n-grams at the tail end of the queue 
    while len(queue) > min_length: 
     queue.popleft() 
     add_queue() 

    return ngrams 


def print_most_frequent(ngrams, num=10): 
    for n in sorted(ngrams): 
     print('----- {} most common {}-grams -----'.format(num, n)) 
     for gram, count in ngrams[n].most_common(num): 
      print('{0}: {1}'.format(' '.join(gram), count)) 
     print('') 


if __name__ == '__main__': 
    if len(sys.argv) < 2: 
     print('Usage: python ngrams.py filename') 
     sys.exit(1) 

    start_time = time.time() 
    with open(sys.argv[1]) as f: 
     ngrams = count_ngrams(f) 
    print_most_frequent(ngrams) 
    elapsed_time = time.time() - start_time 
    print('Took {:.03f} seconds'.format(elapsed_time)) 
0
text="I came from the moon. He went to the other room. She went to the drawing room." 
fixed_text = re.sub("[^a-zA-Z ]"," ",text) 
text_list = fixed_text.split() 
print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1) 

我猜...也許?

>>> text="I came from the moon. He went to the other room. She went to the drawi 
ng room." 
>>> fixed_text = re.sub("[^a-zA-Z ]"," ",text) 
>>> text_list = fixed_text.split() 
>>> print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1) 
[('went to the', 2)] 
>>> 
+0

這不是在Python 2.7 –

+0

工作我有幾個錯別字... –

+0

好,你可以把馬牽到河邊,但你不能讓他喝我猜...你是不是也希望有人可以做你的面試您?或者你的學校考試? –

相關問題