2012-10-26 83 views
0

快速問題在這裏:如果您運行下面的代碼,您會從語料庫中得到每個列表中bigrams的頻率列表。累積頻率,Ngrams

我希望能夠顯示和跟蹤總運行計數。 IE,而不是你看到的顯示,當你運行它爲1或2的頻率,因爲索引是如此之小,它通過整個語料庫計算並顯示頻率。

然後,我基本上需要從模擬原始語料庫的頻率生成文本。

#--------------------------------------------------------- 
#!/usr/bin/env python 
#Ngram Project 

#Import all of the libraries we will need for the program to function 
import nltk 
import nltk.collocations 
from collections import defaultdict 
import nltk.corpus as corpus 
from nltk.corpus import brown 

#--------------------------------------------------------- 

#create our list with the Brown corpus inside variable called "news" 
news = corpus.brown.sents(categories = 'editorial') 
#This will display the type of variable Python recognizes this as 
print "News Is Of The Variable Type : ",type(news),'\n' 

#--------------------------------------------------------- 


#This function will take in the corpus one line at a time 
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>' 
def alter_list(corpus_list): 
    #Simply check for an instance of a period, and if so, replace with '</s>' 
    if corpus_list[-1] == '.': 
     corpus_list[-1] = '</s>' 
     #Stripe is a modifier that allows us to remove all special characters, IE '\n' 
     corpus_list[-1].strip() 
    #Else add to the end of the list item 
    else: 
     corpus_list.append('</s>') 
    return ['<s>'] + corpus_list 

#Displays the length of the list 'news' 
print "The Length of News is : ",len(news),'\n' 
#Allows the user to choose how much of the annotated corpus they would like to see 
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n' 
user = input() 
#Takes user input to determine how many lines to display if any 
if(user >= 1): 
    print "The Corpus Annotated with <s> and </s> looks like : " 
    print "Displaying [",user,"] rows of the corpus : ", '\n' 
    for corpus_list in news[:user]: 
     print(alter_list(corpus_list),'\n') 
#Non positive number catch 
else: 
    print "Fine I Won't Show You Any... ",'\n' 

#--------------------------------------------------------- 

print '\n' 
#Again allows the user to choose the number of lists from Brown corpus to be displayed in 
# Unigram, bigram, trigram and quadgram format 
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ") 
count = 0 

#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information 
#Displayed to the user 
while(count < user2): 
    passer = news[count] 

    def ngrams(passer, n = 2, padding = True): 
     #Padding refers to the same idea demonstrated above, that is bump the first word to the second, making 
     #'None' the first item in each list so that calculations of frequencies can be made 
     pad = [] if not padding else [None]*(n-1) 
     grams = pad + passer + pad 
     return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1))) 

    #In this case, arguments are first: n-gram type (bi, tri, quad) 
    #Followed by in our case the addition of 'padding' 
    #Padding is used in every case here because we need it for calculations 
    #This function structure allows us to pull in corpus parts without the added annotations if need be 
    for size, padding in ((1,1), (2,1), (3, 1), (4, 1)): 
     print '\n%d - grams || padding = %d' % (size, padding) 
     print list(ngrams(passer, size, padding)) 

    # show frequency 
    counts = defaultdict(int) 
    for n_gram in ngrams(passer, 2, False): 
     counts[n_gram] += 1 

    print ("======================================================================================") 
    print '\nFrequencies Of Bigrams:' 
    for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True): 
     print c, n_gram 

    print '\nFrequencies Of Trigrams:' 
    for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True): 
     print c, n_gram 

    count = count + 1 

#--------------------------------------------------------- 
+0

那究竟是什麼問題? –

回答

1

我不確定我是否理解這個問題。 nltk有一個函數generate。 nltk來自哪本書可以在線獲得。

http://nltk.org/book/ch01.html

Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.) 

>>> text3.generate() 
In the beginning of his brother is a hairy man , whose top may reach 
unto heaven ; and ye shall sow the land of Egypt there was no bread in 
all that he was taken out of the month , upon the earth . So shall thy 
wages be ? And they made their father ; and Isaac was old , and kissed 
him : and Laban with his cattle in the midst of the hands of Esau thy 
first born , and Phichol the chief butler unto his son Isaac , she 
+0

抱歉,我的意思是,採取bigrams,trigrams和quad克,然後計算它們的概率,然後使用它來手動生成文本等語料庫。 – user1378618

1

的問題是,你定義字典counts重新對每個句子,所以NGRAM計數復位爲0。將其定義在while循環之上,計數將累積在整個布朗語料庫上。

獎勵建議:您還應該將ngram的定義移到循環之外 - 一遍又一遍地定義相同的函數是無意義的。 (但除了性能,它沒有任何傷害)。更好的是,你應該使用nltk的ngram函數並閱讀關於FreqDist,這就像是類固醇字典。當您處理統計文本生成時,它會派上用場。