我正在研究一種隨機文本生成器 - 無需使用馬爾可夫鏈 - 目前它的工作原理沒有太多問題 - 實際上按照我的標準生成了大量的隨機語句,但是我想使它更加準確,以防止儘可能多的句子儘可能重複 - 。首先,這裏是我的代碼流:如何進一步隨機化此文本生成器?
輸入一個句子作爲輸入 - 這被稱爲觸發字符串,被分配到一個可變
獲取觸發字符串
搜索最長的單詞所有包含該詞語的句子的古騰堡項目數據庫 - 不含大寫小寫字母 -
返回含有我在第3步中談到的詞的最長句子
追加語句在步驟1和步驟4一起
分配句子在步驟4中作爲新的「觸發」句子和重複該過程。請注意,我必須得到最長的單詞在第二句,並繼續像等等 -
這裏是我的代碼:
import nltk
from nltk.corpus import gutenberg
from random import choice
import smtplib #will be for send e-mail option later
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
longestLen2 = 0
longestStr2 = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
while triggerSentence:#run the loop so long as there is a trigger sentence
sets = []
sets2 = []
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
#code to get the sentences containing the longest word, then selecting
#random one of these sentences that are longer than 40 characters
for sentence in listOfSents:
if sentence.count(longestString):
sents= " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence #the first sentence that comes up after I enter input-
split_str = triggerSentence.split()
for apiece in triggerSentence: #find the longest word in this new sentence
if len(apiece) > longestLen2:
longestStr2 = piece
longestLen2 = len(apiece)
if longestStr2 == longestString:
second_longest = sorted(split_str, key=len)[-2]#this should return the second longest word in the sentence in case it's longest word is as same as the longest word of last sentence
#print second_longest #now get second longest word if first is same
#as longest word in previous sentence
for sentence in listOfSents:
if sentence.count(second_longest):
sents = " ".join(sentence)
if len(sents) > 40:
sets2.append(" ".join(sentence))
triggerSentence = choice(sets2)
else:
for sentence in listOfSents:
if sentence.count(longestStr2):
sents = " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence
根據我的代碼,一旦我進入一個觸發器句子,我應該得到另一個包含我輸入的觸發句子中最長的單詞。然後,這個新句子成爲觸發句,它是最長的詞被選中。這是有時出現問題的地方。我觀察到,儘管我放置了代碼行 - 從第47行開始到結尾,算法仍然可以在出現的句子中選擇最長的單詞,而不是查找第二長的單詞。
例如:
觸發字符串= 「蘇格蘭是一個不錯的地方。」
語句1 = - 這是內─
現在隨機句字蘇格蘭,這是我在時間的代碼可能發生的問題-doesn't也罷,它在句子2登場或942或zillion或任何,但我給它發送.2舉例來說 -
句子2 =另一個句子,其中有蘇格蘭詞,但不是第1句中第二長的詞。根據我的代碼,這個句子應該是一些句子,包含句子1中第二長的單詞,而不是蘇格蘭!
我該如何解決這個問題?我試圖儘可能優化代碼,並歡迎任何幫助。
一開始所有換行符都是什麼? – 2010-08-29 22:19:14
@ Zonda333,換行符? – 2010-08-29 22:29:43
@ Zonda333,哦,如果你的意思是爲什麼開始時在代碼之間總是有空行--in import nltk等等,我故意這樣做。在複製/粘貼代碼的過程中,線條往往會混在一起,我可能會在線條之間按下輸入按鈕的位置太多。 – 2010-08-29 22:32:43