如何進一步隨機化此文本生成器？

我正在研究一種隨機文本生成器 - 無需使用馬爾可夫鏈 - 目前它的工作原理沒有太多問題 - 實際上按照我的標準生成了大量的隨機語句，但是我想使它更加準確，以防止儘可能多的句子儘可能重複 - 。首先，這裏是我的代碼流：如何進一步隨機化此文本生成器？

輸入一個句子作爲輸入 - 這被稱爲觸發字符串，被分配到一個可變
獲取觸發字符串
搜索最長的單詞所有包含該詞語的句子的古騰堡項目數據庫 - 不含大寫小寫字母 -
返回含有我在第3步中談到的詞的最長句子
追加語句在步驟1和步驟4一起
分配句子在步驟4中作爲新的「觸發」句子和重複該過程。請注意，我必須得到最長的單詞在第二句，並繼續像等等 -

這裏是我的代碼：

import nltk 

from nltk.corpus import gutenberg 

from random import choice 

import smtplib #will be for send e-mail option later 

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str 

longestLength = 0 

longestString = "" 

longestLen2 = 0 

longestStr2 = "" 

listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format- 

listOfWords = gutenberg.words()# all words in gutenberg books -list format- 

while triggerSentence:#run the loop so long as there is a trigger sentence 
    sets = [] 
    sets2 = [] 
    split_str = triggerSentence.split()#split the sentence into words 

    #code to find the longest word in the trigger sentence input 
    for piece in split_str: 
     if len(piece) > longestLength: 
      longestString = piece 
      longestLength = len(piece) 





    #code to get the sentences containing the longest word, then selecting 
    #random one of these sentences that are longer than 40 characters 

    for sentence in listOfSents: 
     if sentence.count(longestString): 
      sents= " ".join(sentence) 
      if len(sents) > 40: 
       sets.append(" ".join(sentence)) 


    triggerSentence = choice(sets) 
    print triggerSentence #the first sentence that comes up after I enter input- 
    split_str = triggerSentence.split() 

    for apiece in triggerSentence: #find the longest word in this new sentence 
     if len(apiece) > longestLen2: 
      longestStr2 = piece 
      longestLen2 = len(apiece) 
    if longestStr2 == longestString: 
     second_longest = sorted(split_str, key=len)[-2]#this should return the second longest word in the sentence in case it's longest word is as same as the longest word of last sentence 
    #print second_longest #now get second longest word if first is same 
      #as longest word in previous sentence 

     for sentence in listOfSents: 
      if sentence.count(second_longest): 
       sents = " ".join(sentence) 
       if len(sents) > 40: 
        sets2.append(" ".join(sentence)) 
     triggerSentence = choice(sets2) 
    else: 
     for sentence in listOfSents: 
      if sentence.count(longestStr2): 
       sents = " ".join(sentence) 
       if len(sents) > 40: 
       sets.append(" ".join(sentence)) 
     triggerSentence = choice(sets) 


    print triggerSentence

根據我的代碼，一旦我進入一個觸發器句子，我應該得到另一個包含我輸入的觸發句子中最長的單詞。然後，這個新句子成爲觸發句，它是最長的詞被選中。這是有時出現問題的地方。我觀察到，儘管我放置了代碼行 - 從第47行開始到結尾，算法仍然可以在出現的句子中選擇最長的單詞，而不是查找第二長的單詞。

例如：

觸發字符串= 「蘇格蘭是一個不錯的地方。」

語句1 = - 這是內─

現在隨機句字蘇格蘭，這是我在時間的代碼可能發生的問題-doesn't也罷，它在句子2登場或942或zillion或任何，但我給它發送.2舉例來說 -

句子2 =另一個句子，其中有蘇格蘭詞，但不是第1句中第二長的詞。根據我的代碼，這個句子應該是一些句子，包含句子1中第二長的單詞，而不是蘇格蘭！

我該如何解決這個問題？我試圖儘可能優化代碼，並歡迎任何幫助。

來源

2010-08-29 mojave_ranger

一開始所有換行符都是什麼？ – 2010-08-29 22:19:14

@ Zonda333，換行符？ – 2010-08-29 22:29:43

@ Zonda333，哦，如果你的意思是爲什麼開始時在代碼之間總是有空行--in import nltk等等，我故意這樣做。在複製/粘貼代碼的過程中，線條往往會混在一起，我可能會在線條之間按下輸入按鈕的位置太多。 – 2010-08-29 22:32:43

根本就沒有什麼隨機的算法。它應該始終是確定性的。

我不太清楚你想在這裏做什麼。如果是生成隨機單詞，只需使用字典和隨機模塊。如果您想從古騰堡項目中抓取隨機句子，可以使用隨機模塊選擇一項作品，然後從該作品中選擇一個句子。

來源

2010-08-30 21:34:27 aterrel

如何進一步隨機化此文本生成器？

回答

相關問題