2011-01-22 47 views
1

NLTK帶有一些語料庫樣本: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml如何從NLTK附帶的樣本語料庫中提取單詞?

我希望只有沒有編碼的文本。我不知道如何提取這些內容。我想提取的是:

1)nps_chat:文件名與解壓縮後的文件名相同,如10-19-20s_706posts.xml。 這樣的文件是XML格式,如:

<Posts> 
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals> 

       <t pos="RB" word="now"/> 
       <t pos="PRP" word="im"/> 
       <t pos="VBD" word="left"/> 
       <t pos="IN" word="with"/> 
       <t pos="DT" word="this"/> 
       <t pos="JJ" word="gay"/> 
       <t pos="NN" word="name"/> 
      </terminals> 

     </Post> 
      ... 
      ... 

我只想要實際崗位:

now im left with this gay name 

如何在NLTK或做(什麼)保存在本地磁盤剝離編碼後裸露的帖子嗎?

2)總機轉錄本。這種類型的文件(解壓縮後的文件名是話語)包含以下格式。我要的是剝奪前面標記:

o A.1 utt1: Okay,/
qy A.1 utt2: have you ever served as a juror?/
ng B.2 utt1: Never./
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors./
b A.3 utt1: Uh-huh./
sd A.3 utt2: I never have either./
% B.4 utt1: You haven't, {F huh. }/
... 
... 

我只想有:

Okay,/
have you ever served as a juror?/
Never./
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors./
Uh-huh./
I never have either./
You haven't, {F huh. }/
... 
... 

非常感謝你提前。

回答

2

首先,您需要爲語料庫製作corpus reader。有一些讀者文集,你可以在nltk.corpus使用,如:

AlpinoCorpusReader 
BNCCorpusReader 
BracketParseCorpusReader 
CMUDictCorpusReader 
CategorizedCorpusReader 
CategorizedPlaintextCorpusReader 
CategorizedTaggedCorpusReader 
ChunkedCorpusReader 
ConllChunkCorpusReader 
ConllCorpusReader 
CorpusReader 
DependencyCorpusReader 
EuroparlCorpusReader 
IEERCorpusReader 
IPIPANCorpusReader 
IndianCorpusReader 
MacMorphoCorpusReader 
NPSChatCorpusReader 
NombankCorpusReader 
PPAttachmentCorpusReader 
Pl196xCorpusReader 
PlaintextCorpusReader 
PortugueseCategorizedPlaintextCorpusReader 
PropbankCorpusReader 
RTECorpusReader 
SensevalCorpusReader 
SinicaTreebankCorpusReader 
StringCategoryCorpusReader 
SwadeshCorpusReader 
SwitchboardCorpusReader 
SyntaxCorpusReader 
TaggedCorpusReader 
TimitCorpusReader 
ToolboxCorpusReader 
VerbnetCorpusReader 
WordListCorpusReader 
WordNetCorpusReader 
WordNetICCorpusReader 
XMLCorpusReader 
YCOECorpusReader 

一旦你做了一個文集的讀者你的文集,像這樣:

c = nltk.corpus.whateverCorpusReaderYouChoose(directoryWithCorpus, regexForFileTypes) 

你可以出來的話通過使用以下代碼:

paragraphs = [para for para in c.paras()] 
for para in paragraphs: 
    words = [word for sentence in para for word in sentence] 

這應該爲您提供您的語料庫所有段落中所有單詞的列表。

希望這有助於

+0

謝謝inspectorG4dget所有的話,我正在測試你的代碼。 – Dylan 2011-01-22 10:52:21

1

可以使用.words()財產NLTK語料庫

content = nps_chat.words()

這會給你一個列表

['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]

相關問題