找到3個字最頻繁的組中的句子

-2

text="I came from the moon. He went to the other room. She went to the drawing room."

這裏的3個字最頻繁羣組的文字是"went to the"

我知道如何找到最頻繁的bigram或trigram但我被困在這個。我想找到這個解決方案without using NLTK library。

來源

2016-07-20 Pankaj Sharma

你有什麼理由不想使用NLTK嗎？ – Ares

jst嘗試沒有nltk兄弟.. –

爲什麼這個問題擱置？三個人提供了合適的答案，所以他們顯然明白要求什麼。 –

import string 

text="I came from the moon. He went to the other room. She went to the drawing room." 

for character in string.punctuation: 
    text = text.replace(character, " ") 

while text != text.replace(" ", " "): 
    text = text.replace(" ", " ") 

text = text.split(" ") 

wordlist = [] 
frequency_dict = dict() 

for i in range(len(text)-3): 
    wordlist.append([text[i], text[i+1], text[i+2]]) 

for three_words in wordlist: 
    frequency= wordlist.count(three_words) 
    frequency_dict[", ".join(three_words)] = frequency 

print max(frequency_dict, key=frequency_dict.get), frequency_dict[max(frequency_dict, key=frequency_dict.get)]

輸出：went, to, the 2

不幸的是列表不是可哈希。否則，它將有助於創建一組three_words項目。

來源

2016-07-20 22:21:04

nltk使得這個問題微不足道，但看到因爲你不希望這樣的依賴不，我已經包括了只用核心庫的簡單實現。該代碼適用於python2.7和python3.x，並使用collections.Counter來計算n-grams的頻率。在計算上，它是O（NM），其中N是文本中的詞的數量，M是被計數的n-gram的數量（所以如果有人計算uni和bigrams，M = 2）。

import collections 
import re 
import sys 
import time 


# Convert a string to lowercase and split into words (w/o punctuation) 
def tokenize(string): 
    return re.findall(r'\w+', string.lower()) 


def count_ngrams(lines, min_length=2, max_length=4): 
    lengths = range(min_length, max_length + 1) 
    ngrams = {length: collections.Counter() for length in lengths} 
    queue = collections.deque(maxlen=max_length) 

    # Helper function to add n-grams at start of current queue to dict 
    def add_queue(): 
     current = tuple(queue) 
     for length in lengths: 
      if len(current) >= length: 
       ngrams[length][current[:length]] += 1 

    # Loop through all lines and words and add n-grams to dict 
    for line in lines: 
     for word in tokenize(line): 
      queue.append(word) 
      if len(queue) >= max_length: 
       add_queue() 

    # Make sure we get the n-grams at the tail end of the queue 
    while len(queue) > min_length: 
     queue.popleft() 
     add_queue() 

    return ngrams 


def print_most_frequent(ngrams, num=10): 
    for n in sorted(ngrams): 
     print('----- {} most common {}-grams -----'.format(num, n)) 
     for gram, count in ngrams[n].most_common(num): 
      print('{0}: {1}'.format(' '.join(gram), count)) 
     print('') 


if __name__ == '__main__': 
    if len(sys.argv) < 2: 
     print('Usage: python ngrams.py filename') 
     sys.exit(1) 

    start_time = time.time() 
    with open(sys.argv[1]) as f: 
     ngrams = count_ngrams(f) 
    print_most_frequent(ngrams) 
    elapsed_time = time.time() - start_time 
    print('Took {:.03f} seconds'.format(elapsed_time))

來源

2016-07-20 21:58:54 manan

text="I came from the moon. He went to the other room. She went to the drawing room." 
fixed_text = re.sub("[^a-zA-Z ]"," ",text) 
text_list = fixed_text.split() 
print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1)

我猜...也許？

>>> text="I came from the moon. He went to the other room. She went to the drawi 
ng room." 
>>> fixed_text = re.sub("[^a-zA-Z ]"," ",text) 
>>> text_list = fixed_text.split() 
>>> print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1) 
[('went to the', 2)] 
>>>

來源

2016-07-20 22:00:24

這不是在Python 2.7 –

工作我有幾個錯別字... –

好，你可以把馬牽到河邊，但你不能讓他喝我猜...你是不是也希望有人可以做你的面試您？或者你的學校考試？ –

找到3個字最頻繁的組中的句子

回答

相關問題