NLTK帶有一些語料庫樣本: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml如何從NLTK附帶的樣本語料庫中提取單詞?
我希望只有沒有編碼的文本。我不知道如何提取這些內容。我想提取的是:
1)nps_chat:文件名與解壓縮後的文件名相同,如10-19-20s_706posts.xml。 這樣的文件是XML格式,如:
<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>
<t pos="RB" word="now"/>
<t pos="PRP" word="im"/>
<t pos="VBD" word="left"/>
<t pos="IN" word="with"/>
<t pos="DT" word="this"/>
<t pos="JJ" word="gay"/>
<t pos="NN" word="name"/>
</terminals>
</Post>
...
...
我只想要實際崗位:
now im left with this gay name
如何在NLTK或做(什麼)保存在本地磁盤剝離編碼後裸露的帖子嗎?
2)總機轉錄本。這種類型的文件(解壓縮後的文件名是話語)包含以下格式。我要的是剝奪前面標記:
o A.1 utt1: Okay,/
qy A.1 utt2: have you ever served as a juror?/
ng B.2 utt1: Never./
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors./
b A.3 utt1: Uh-huh./
sd A.3 utt2: I never have either./
% B.4 utt1: You haven't, {F huh. }/
...
...
我只想有:
Okay,/
have you ever served as a juror?/
Never./
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors./
Uh-huh./
I never have either./
You haven't, {F huh. }/
...
...
非常感謝你提前。
謝謝inspectorG4dget所有的話,我正在測試你的代碼。 – Dylan 2011-01-22 10:52:21