的Python NLTK莎士比亞文集

我試圖導入從莎士比亞的NLTK語料庫的句子 - 以下this幫助網站 - 但我有麻煩訪問句子（爲了訓練word2vec模型）：的Python NLTK莎士比亞文集

from nltk.corpus import shakespeare #XMLCorpusreader 
shakespeare.fileids() 
['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', ...] 

play = shakespeare.xml('dream.xml') #ElementTree object 
print(play) 
<Element 'PLAY' at ...> 

for i in range(9): 
    print('%s: %s' % (play[i].tag, play[i].text))

返回以下內容：

TITLE: A Midsummer Night's Dream 
PERSONAE: 

SCNDESCR: SCENE Athens, and a wood near it. 
PLAYSUBT: A MIDSUMMER NIGHT'S DREAM 
ACT: None 
ACT: None 
ACT: None 
ACT: None 
ACT: None

爲什麼所有的行爲都沒有？

無的（http://www.nltk.org/howto/corpus.html#data-access-methods）這裏定義的方法（.sents（），tagged_sents（），chunked_sents（），parsed_sents（））似乎工作時施加在莎士比亞XMLCorpusReader

我想了解：
1 /如何讓句子

2 /怎麼知道如何尋找他們在ElementTree的對象

來源

2017-05-01 Romain G

問題歸結爲如何從一個元素樹的所有孩子中提取文本。這是相當複製到Python element tree - extract text from element, stripping tags

試試這個：

for p in play: 
    print('%s: %s' % (p.tag, list(p.itertext())))

此處插入邏輯你想要做

什麼

來源

2017-05-01 15:12:30

的Python NLTK莎士比亞文集

回答

相關問題