2012-12-13 35 views
1

我正在嘗試構建一個測試單元來壓力測試發佈管理的一個非常大的實現。我想過使用NLTK生成段落,關於不同的東西和文章的隨機標題。如何使用NLTK生成隨機段落

NLTK會做這樣的事情嗎?我想盡量讓每篇文章都是獨一無二的,以測試不同的佈局大小。我也想和科目一樣。

P.S我需要產生超過100萬條將用於最終測試許多東西(的性能,搜索,layout..etc)

誰能請指教?

+0

它必須是NLTK?我已經用過去的其他方式來實現你需要的結果 – rikAtee

+0

沒有關閉,它可以是任何東西..但是,我的印象是,只有NLTK可以做到這一點。但是,如果你有任何其他選擇,一切手段 –

回答

5

我用過這個。它需要Noam Chomsky的短語並生成隨機段落。您可以將原料文本更改爲任何你想要的。當然,你使用的文字越多越好。

# List of LEADINs to buy time. 
leadins = """To characterize a linguistic level L, 
     On the other hand, 
     This suggests that 
     It appears that 
     Furthermore """ 

# List of SUBJECTs chosen for maximum professorial macho. 
subjects = """ the notion of level of grammaticalness 
     a case of semigrammaticalness of a different sort 
     most of the methodological work in modern linguistics 
     a subset of English sentences interesting on quite independent grounds 
     the natural general principle that will subsume this case """ 

#List of VERBs chosen for autorecursive obfuscation. 
verbs = """can be defined in such a way as to impose 
     delimits 
     suffices to account for 
     cannot be arbitrary in 
     is not subject to """ 


# List of OBJECTs selected for profound sententiousness. 

objects = """ problems of phonemic and morphological analysis. 
     a corpus of utterance tokens upon which conformity has been defined by the paired utterance test. 
     the traditional practice of grammarians. 
     the levels of acceptability from fairly high (e.g. (99a)) to virtual gibberish (e.g. (98d)). 
     a stipulation to place the constructions into these various categories. 
     a descriptive fact. 
     a parasitic gap construction.""" 

import textwrap, random 
from itertools import chain, islice, izip 
from time import sleep 

def chomsky(times=1, line_length=72): 
    parts = [] 
    for part in (leadins, subjects, verbs, objects): 
     phraselist = map(str.strip, part.splitlines()) 
     random.shuffle(phraselist) 
     parts.append(phraselist) 
    output = chain(*islice(izip(*parts), 0, times)) 
    return textwrap.fill(' '.join(output), line_length) 

print chomsky() 

這回我:

這表明,不同的排序 的semigrammaticalness的情況下,不須在其符合 已被配對話語定義話語標記的語料庫測試。

和標題,你當然可以做

chomsky().split('\n')[0]