我用一個非常簡單的代碼回答你的問題,只是爲了說明。請注意,bigram的估計比你想象的要複雜一點。它需要在分而治之的方法中完成。它可以使用不同的模型進行估計,其中最常見的是隱馬爾可夫模型,我將在下面的代碼中進行解釋。請注意,數據的大小越大,估計就越好。我在Brown Corpus上測試了以下代碼。
def bigramEstimation(file):
'''A very basic solution for the sake of illustration.
It can be calculated in a more sophesticated way.
'''
lst = [] # This will contain the tokens
unigrams = {} # for unigrams and their counts
bigrams = {} # for bigrams and their counts
# 1. Read the textfile, split it into a list
text = open(file, 'r').read()
lst = text.strip().split()
print 'Read ', len(lst), ' tokens...'
del text # No further need for text var
# 2. Generate unigrams frequencies
for l in lst:
if not l in unigrams:
unigrams[l] = 1
else:
unigrams[l] += 1
print 'Generated ', len(unigrams), ' unigrams...'
# 3. Generate bigrams with frequencies
for i in range(len(lst) - 1):
temp = (lst[i], lst[i+1]) # Tuples are easier to reuse than nested lists
if not temp in bigrams:
bigrams[temp] = 1
else:
bigrams[temp] += 1
print 'Generated ', len(bigrams), ' bigrams...'
# Now Hidden Markov Model
# bigramProb = (Count(bigram)/Count(first_word)) + (Count(first_word)/ total_words_in_corpus)
# A few things we need to keep in mind
total_corpus = sum(unigrams.values())
# You can add smoothed estimation if you want
print 'Calculating bigram probabilities and saving to file...'
# Comment the following 4 lines if you do not want the header in the file.
with open("bigrams.txt", 'a') as out:
out.write('Bigram' + '\t' + 'Bigram Count' + '\t' + 'Uni Count' + '\t' + 'Bigram Prob')
out.write('\n')
out.close()
for k,v in bigrams.iteritems():
# first_word = helle in ('hello', 'world')
first_word = k[0]
first_word_count = unigrams[first_word]
bi_prob = bigrams[k]/unigrams[first_word]
uni_prob = unigrams[first_word]/total_corpus
final_prob = bi_prob + uni_prob
with open("bigrams.txt", 'a') as out:
out.write(k[0] + ' ' + k[1] + '\t' + str(v) + '\t' + str(first_word_count) + '\t' + str(final_prob)) # Delete whatever you don't want to print into a file
out.write('\n')
out.close()
# Callings
bigramEstimation('hello.txt')
我希望這可以幫助你!
請參閱https://stackoverflow.com/questions/7591258/fast-n-gram-calculation和https://stackoverflow.com/questions/21883108/fast-optimize-n-gram-implementations-in-python以及https://stackoverflow.com/questions/40373414/counting-bigrams-real-fast-with-or-without-multiprocessing-python – alvas
我需要兩字估計......所有其他的答案都只是給兩字。我需要它的可能性。示例:計數(你好如何)/計數(你好)。你知道該怎麼做嗎? – Ash
你需要一個ngram語言模型... – alvas