2012-10-20 21 views
5

我想修改下面的腳本,以便從隨機數的腳本生成的句子中創建段落。換句話說,在添加換行符之前,連接一個隨機數(如1-5個)的句子。如何從馬爾可夫鏈輸出創建段落?

該腳本原樣工作,但輸出是由換行符分隔的短句子。我想收集一些句子成段落。

有關最佳實踐的任何想法?謝謝。

""" 
    from: http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python 
""" 

import random; 
import sys; 

stopword = "\n" # Since we split on whitespace, this can never be a word 
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word 
sentencesep = "\n" #String used to seperate sentences 


# GENERATE TABLE 
w1 = stopword 
w2 = stopword 
table = {} 

for line in sys.stdin: 
    for word in line.split(): 
     if word[-1] in stopsentence: 
      table.setdefault((w1, w2), []).append(word[0:-1]) 
      w1, w2 = w2, word[0:-1] 
      word = word[-1] 
     table.setdefault((w1, w2), []).append(word) 
     w1, w2 = w2, word 
# Mark the end of the file 
table.setdefault((w1, w2), []).append(stopword) 

# GENERATE SENTENCE OUTPUT 
maxsentences = 20 

w1 = stopword 
w2 = stopword 
sentencecount = 0 
sentence = [] 

while sentencecount < maxsentences: 
    newword = random.choice(table[(w1, w2)]) 
    if newword == stopword: sys.exit() 
    if newword in stopsentence: 
     print ("%s%s%s" % (" ".join(sentence), newword, sentencesep)) 
     sentence = [] 
     sentencecount += 1 
    else: 
     sentence.append(newword) 
    w1, w2 = w2, newword 

編輯01:

好吧,我已經拼湊一個簡單的「段落的包裝,」效果很好的句子聚集成段落,但它的輸出搞砸句子生成器 - 例如,在第一個單詞的重複性問題中,我遇到了其他問題。

但前提是合理的;我只需要弄清楚爲什麼句子循環的功能受到了段落循環的影響。請告知,如果你能看到的問題:

### 
# usage: $ python markov_sentences.py <input.txt> output.txt 
# from: http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python 
### 

import random; 
import sys; 

stopword = "\n" # Since we split on whitespace, this can never be a word 
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word 
paragraphsep = "\n\n" #String used to seperate sentences 


# GENERATE TABLE 
w1 = stopword 
w2 = stopword 
table = {} 

for line in sys.stdin: 
    for word in line.split(): 
     if word[-1] in stopsentence: 
      table.setdefault((w1, w2), []).append(word[0:-1]) 
      w1, w2 = w2, word[0:-1] 
      word = word[-1] 
     table.setdefault((w1, w2), []).append(word) 
     w1, w2 = w2, word 
# Mark the end of the file 
table.setdefault((w1, w2), []).append(stopword) 

# GENERATE PARAGRAPH OUTPUT 
maxparagraphs = 10 
paragraphs = 0 # reset the outer 'while' loop counter to zero 

while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached 
    w1 = stopword 
    w2 = stopword 
    stopsentence = (".", "!", "?",) 
    sentence = [] 
    sentencecount = 0 # reset the inner 'while' loop counter to zero 
    maxsentences = random.randrange(1,5) # random sentences per paragraph 

    while sentencecount < maxsentences: # start inner loop, until maxsentences is reached 
     newword = random.choice(table[(w1, w2)]) # random word from word table 
     if newword == stopword: sys.exit() 
     elif newword in stopsentence: 
      print ("%s%s" % (" ".join(sentence), newword), end=" ") 
      sentencecount += 1 # increment the sentence counter 
     else: 
      sentence.append(newword) 
     w1, w2 = w2, newword 
    print (paragraphsep) # newline space 
    paragraphs = paragraphs + 1 # increment the paragraph counter 


# EOF 

編輯02:

新增sentence = []按照下面的答案爲elif聲明。以機智;

 elif newword in stopsentence: 
      print ("%s%s" % (" ".join(sentence), newword), end=" ") 
      sentence = [] # I have to be here to make the new sentence start as an empty list!!! 
      sentencecount += 1 # increment the sentence counter 

編輯03:

這是這個劇本的最後一次迭代。感謝在整理這個問題上的幫助而感到悲傷。我希望別人可以有一些樂趣,我知道我會的。 ;)

供參考:有一個小的神器 - 有一個額外的段落結束空間,如果您使用此腳本,您可能需要清理。但除此之外,馬爾可夫鏈文本生成的完美實現。

### 
# usage: python markov_sentences.py <input.txt> output.txt 
# from: http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python 
### 

import random; 
import sys; 

stopword = "\n" # Since we split on whitespace, this can never be a word 
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word 
sentencesep = "\n" #String used to seperate sentences 


# GENERATE TABLE 
w1 = stopword 
w2 = stopword 
table = {} 

for line in sys.stdin: 
    for word in line.split(): 
     if word[-1] in stopsentence: 
      table.setdefault((w1, w2), []).append(word[0:-1]) 
      w1, w2 = w2, word[0:-1] 
      word = word[-1] 
     table.setdefault((w1, w2), []).append(word) 
     w1, w2 = w2, word 
# Mark the end of the file 
table.setdefault((w1, w2), []).append(stopword) 

# GENERATE SENTENCE OUTPUT 
maxsentences = 20 

w1 = stopword 
w2 = stopword 
sentencecount = 0 
sentence = [] 
paragraphsep = "\n" 
count = random.randrange(1,5) 

while sentencecount < maxsentences: 
    newword = random.choice(table[(w1, w2)]) # random word from word table 
    if newword == stopword: sys.exit() 
    if newword in stopsentence: 
     print ("%s%s" % (" ".join(sentence), newword), end=" ") 
     sentence = [] 
     sentencecount += 1 # increment the sentence counter 
     count -= 1 
     if count == 0: 
      count = random.randrange(1,5) 
      print (paragraphsep) # newline space 
    else: 
     sentence.append(newword) 
    w1, w2 = w2, newword 


# EOF 

回答

3

你需要複製

sentence = [] 

返回到

elif newword in stopsentence: 

條款。

所以

while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached 
    w1 = stopword 
    w2 = stopword 
    stopsentence = (".", "!", "?",) 
    sentence = [] 
    sentencecount = 0 # reset the inner 'while' loop counter to zero 
    maxsentences = random.randrange(1,5) # random sentences per paragraph 

    while sentencecount < maxsentences: # start inner loop, until maxsentences is reached 
     newword = random.choice(table[(w1, w2)]) # random word from word table 
     if newword == stopword: sys.exit() 
     elif newword in stopsentence: 
      print ("%s%s" % (" ".join(sentence), newword), end=" ") 
      sentence = [] # I have to be here to make the new sentence start as an empty list!!! 
      sentencecount += 1 # increment the sentence counter 
     else: 
      sentence.append(newword) 
     w1, w2 = w2, newword 
    print (paragraphsep) # newline space 
    paragraphs = paragraphs + 1 # increment the paragraph counter 

編輯

這裏是不使用外部環路中的溶液。

""" 
    from: http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python 
""" 

import random; 
import sys; 

stopword = "\n" # Since we split on whitespace, this can never be a word 
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word 
sentencesep = "\n" #String used to seperate sentences 


# GENERATE TABLE 
w1 = stopword 
w2 = stopword 
table = {} 

for line in sys.stdin: 
    for word in line.split(): 
     if word[-1] in stopsentence: 
      table.setdefault((w1, w2), []).append(word[0:-1]) 
      w1, w2 = w2, word[0:-1] 
      word = word[-1] 
     table.setdefault((w1, w2), []).append(word) 
     w1, w2 = w2, word 
# Mark the end of the file 
table.setdefault((w1, w2), []).append(stopword) 

# GENERATE SENTENCE OUTPUT 
maxsentences = 20 

w1 = stopword 
w2 = stopword 
sentencecount = 0 
sentence = [] 
paragraphsep == "\n\n" 
count = random.randrange(1,5) 

while sentencecount < maxsentences: 
    newword = random.choice(table[(w1, w2)]) 
    if newword == stopword: sys.exit() 
    if newword in stopsentence: 
     print ("%s%s" % (" ".join(sentence), newword), end=" ") 
     sentence = [] 
     sentencecount += 1 
     count -= 1 
     if count == 0: 
      count = random.randrange(1,5) 
      print (paragraphsep) 
    else: 
     sentence.append(newword) 
    w1, w2 = w2, newword 
+0

糟糕!是的,我一定是在某個時候抽出來的,忘記把它放回去。謝謝你的見解!這幾乎成功了。似乎句子循環爲每個句子重新使用相同的開始單詞。關於如何混合它爲句子生成選擇的第一個單詞的任何想法? –

+0

我添加了一個不需要外部循環的獨立解決方案。 – grieve

+0

我目前沒有安裝python 3,所以你可能需要調整第二個解決方案的語法。 – grieve

1

您是否理解此代碼?我敢打賭,你可以找到打印該句子的位,並將其更改爲一起打印幾個句子,而無需退貨。您可以在句子位周圍添加另一個while循環以獲取多個段落。

語法提示:

print 'hello' 
print 'there' 
hello 
there 

print 'hello', 
print 'there' 
hello there 

print 'hello', 
print 
print 'there' 

的一點是,在打印語句結束一個逗號防止在生產線末端的回報,一個空白的print語句打印的回報。

+0

是的,我遵循。麻煩的是,我用'print'語句嘗試的所有內容都無助於將句子彙總到段落中(除非您數出所有的換行符,製作一個大段落)。 'while'循環是我想到的,但我不太確定如何包裝句子部分。我試過的每件事都會導致各種錯誤,所以我想我會問專家。什麼是最好的方法來告訴它「生成x(例如1-5)的句子數量,然後插入一個換行符,然後重複,直到達到」maxsentences「? –