假設我有一個像找到3個字最頻繁的組中的句子
text="I came from the moon. He went to the other room. She went to the drawing room."
這裏的3個字最頻繁羣組的文字是"went to the"
我知道如何找到最頻繁的bigram
或trigram
但我被困在這個。我想找到這個解決方案without using NLTK library
。
假設我有一個像找到3個字最頻繁的組中的句子
text="I came from the moon. He went to the other room. She went to the drawing room."
這裏的3個字最頻繁羣組的文字是"went to the"
我知道如何找到最頻繁的bigram
或trigram
但我被困在這個。我想找到這個解決方案without using NLTK library
。
import string
text="I came from the moon. He went to the other room. She went to the drawing room."
for character in string.punctuation:
text = text.replace(character, " ")
while text != text.replace(" ", " "):
text = text.replace(" ", " ")
text = text.split(" ")
wordlist = []
frequency_dict = dict()
for i in range(len(text)-3):
wordlist.append([text[i], text[i+1], text[i+2]])
for three_words in wordlist:
frequency= wordlist.count(three_words)
frequency_dict[", ".join(three_words)] = frequency
print max(frequency_dict, key=frequency_dict.get), frequency_dict[max(frequency_dict, key=frequency_dict.get)]
輸出:went, to, the 2
不幸的是列表不是可哈希。否則,它將有助於創建一組three_words項目。
nltk
使得這個問題微不足道,但看到因爲你不希望這樣的依賴不,我已經包括了只用核心庫的簡單實現。該代碼適用於python2.7和python3.x,並使用collections.Counter來計算n-grams的頻率。在計算上,它是O(NM),其中N是文本中的詞的數量,M是被計數的n-gram的數量(所以如果有人計算uni和bigrams,M = 2)。
import collections
import re
import sys
import time
# Convert a string to lowercase and split into words (w/o punctuation)
def tokenize(string):
return re.findall(r'\w+', string.lower())
def count_ngrams(lines, min_length=2, max_length=4):
lengths = range(min_length, max_length + 1)
ngrams = {length: collections.Counter() for length in lengths}
queue = collections.deque(maxlen=max_length)
# Helper function to add n-grams at start of current queue to dict
def add_queue():
current = tuple(queue)
for length in lengths:
if len(current) >= length:
ngrams[length][current[:length]] += 1
# Loop through all lines and words and add n-grams to dict
for line in lines:
for word in tokenize(line):
queue.append(word)
if len(queue) >= max_length:
add_queue()
# Make sure we get the n-grams at the tail end of the queue
while len(queue) > min_length:
queue.popleft()
add_queue()
return ngrams
def print_most_frequent(ngrams, num=10):
for n in sorted(ngrams):
print('----- {} most common {}-grams -----'.format(num, n))
for gram, count in ngrams[n].most_common(num):
print('{0}: {1}'.format(' '.join(gram), count))
print('')
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: python ngrams.py filename')
sys.exit(1)
start_time = time.time()
with open(sys.argv[1]) as f:
ngrams = count_ngrams(f)
print_most_frequent(ngrams)
elapsed_time = time.time() - start_time
print('Took {:.03f} seconds'.format(elapsed_time))
text="I came from the moon. He went to the other room. She went to the drawing room."
fixed_text = re.sub("[^a-zA-Z ]"," ",text)
text_list = fixed_text.split()
print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1)
我猜...也許?
>>> text="I came from the moon. He went to the other room. She went to the drawi
ng room."
>>> fixed_text = re.sub("[^a-zA-Z ]"," ",text)
>>> text_list = fixed_text.split()
>>> print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1)
[('went to the', 2)]
>>>
這不是在Python 2.7 –
工作我有幾個錯別字... –
好,你可以把馬牽到河邊,但你不能讓他喝我猜...你是不是也希望有人可以做你的面試您?或者你的學校考試? –
你有什麼理由不想使用NLTK嗎? – Ares
jst嘗試沒有nltk兄弟.. –
爲什麼這個問題擱置?三個人提供了合適的答案,所以他們顯然明白要求什麼。 –