2017-10-09 63 views
1

所以,我超新的Python和我計算的雙字母組沒有任何使用Python包的這個項目。我必須使用python 2.7。這是我迄今爲止所擁有的。它需要一個文件hello然後給出一個輸出,如 {'Hello','How'} 5。現在對於二元數的估計,我必須除以5的Hello(在整個文本文件中出現了多少次'Hello')。 我卡住任何幫助請!如何在不使用nltk庫的情況下計算二元組估計值?

f = open("hello.txt", 'r') 
    dictionary={} 
    for line in f: 
     for word in line.split(): 
      items = line.split() 
      bigrams = [] 
      for i in range(len(items) - 1): 
       bigrams.append((items[i], items[i+1])) 
       my_dict = {i:bigrams.count(i) for i in bigrams} 
       # print(my_dict) 
       with open('bigram.txt', 'wt') as out: 
        out.write(str(my_dict)) 
    f.close() 
+0

請參閱https://stackoverflow.com/questions/7591258/fast-n-gram-calculation和https://stackoverflow.com/questions/21883108/fast-optimize-n-gram-implementations-in-python以及https://stackoverflow.com/questions/40373414/counting-bigrams-real-fast-with-or-without-multiprocessing-python – alvas

+0

我需要兩字估計......所有其他的答案都只是給兩字。我需要它的可能性。示例:計數(你好如何)/計數(你好)。你知道該怎麼做嗎? – Ash

+0

你需要一個ngram語言模型... – alvas

回答

0

我用一個非常簡單的代碼回答你的問題,只是爲了說明。請注意,bigram的估計比你想象的要複雜一點。它需要在分而治之的方法中完成。它可以使用不同的模型進行估計,其中最常見的是隱馬爾可夫模型,我將在下面的代碼中進行解釋。請注意,數據的大小越大,估計就越好。我在Brown Corpus上測試了以下代碼。

def bigramEstimation(file): 
    '''A very basic solution for the sake of illustration. 
     It can be calculated in a more sophesticated way. 
     ''' 

    lst = [] # This will contain the tokens 
    unigrams = {} # for unigrams and their counts 
    bigrams = {} # for bigrams and their counts 

    # 1. Read the textfile, split it into a list 
    text = open(file, 'r').read() 
    lst = text.strip().split() 
    print 'Read ', len(lst), ' tokens...' 

    del text # No further need for text var 



    # 2. Generate unigrams frequencies 
    for l in lst: 
     if not l in unigrams: 
      unigrams[l] = 1 
     else: 
      unigrams[l] += 1 

    print 'Generated ', len(unigrams), ' unigrams...' 

    # 3. Generate bigrams with frequencies 
    for i in range(len(lst) - 1): 
     temp = (lst[i], lst[i+1]) # Tuples are easier to reuse than nested lists 
     if not temp in bigrams: 
      bigrams[temp] = 1 
     else: 
      bigrams[temp] += 1 

    print 'Generated ', len(bigrams), ' bigrams...' 

    # Now Hidden Markov Model 
    # bigramProb = (Count(bigram)/Count(first_word)) + (Count(first_word)/ total_words_in_corpus) 
    # A few things we need to keep in mind 
    total_corpus = sum(unigrams.values()) 
    # You can add smoothed estimation if you want 


    print 'Calculating bigram probabilities and saving to file...' 

    # Comment the following 4 lines if you do not want the header in the file. 
    with open("bigrams.txt", 'a') as out: 
     out.write('Bigram' + '\t' + 'Bigram Count' + '\t' + 'Uni Count' + '\t' + 'Bigram Prob') 
     out.write('\n') 
     out.close() 


    for k,v in bigrams.iteritems(): 
     # first_word = helle in ('hello', 'world') 
     first_word = k[0] 
     first_word_count = unigrams[first_word] 
     bi_prob = bigrams[k]/unigrams[first_word] 
     uni_prob = unigrams[first_word]/total_corpus 

     final_prob = bi_prob + uni_prob 
     with open("bigrams.txt", 'a') as out: 
      out.write(k[0] + ' ' + k[1] + '\t' + str(v) + '\t' + str(first_word_count) + '\t' + str(final_prob)) # Delete whatever you don't want to print into a file 
      out.write('\n') 
      out.close() 




# Callings 
bigramEstimation('hello.txt') 

我希望這可以幫助你!

+0

參見http://cs.nyu.edu/courses/spring17/CSCI-UA.0480-009/lecture3-and-half-n-grams.pdf – alvas

+0

感謝您的迴應。但我認爲它是一點點關閉。所以如果我有文字。「Hello Hello How」for bigram P(How | Hello)它應該計算(你好如何),它是1除以(你好)的計數是2的概率。概率1/2。 – Ash

+0

你對hello hello的評分是多少? – Mohammed

相關問題