簡而言之:
' '.join([word + '/' + pos for word, pos in tagged_sent]
在長:
我想你大概得太多使用字符串函數Concat的字符串,它真的不貴。
import time
from nltk.corpus import brown
tagged_corpus = brown.tagged_sents()
start = time.time()
with open('output.txt', 'w') as fout:
for i, sent in enumerate(tagged_corpus):
print(' '.join([word + '/' + pos for word, pos in sent]), end='\n', file=fout)
end = time.time() - start
print (i, end)
花了2.955秒我的筆記本電腦從棕色語料庫所有57339分的句子。
[出]:
$ head -n1 output.txt
The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.
但使用字符串來連接單詞和POS可以在需要讀取你的標記輸出,例如在以後帶來麻煩
>>> from nltk import pos_tag
>>> tagged_sent = pos_tag('cat/dog'.split())
>>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent])
>>> tagged_sent_str
'cat/NN //CD dog/NN'
>>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()]
[('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]
如果你想保存在標籤的輸出,再後來讀它,最好使用pickle
保存tagged_output,例如
>>> import pickle
>>> tagged_sent = pos_tag('cat/dog'.split())
>>> with open('tagged_sent.pkl', 'wb') as fout:
... pickle.dump(tagged_sent, fout)
...
>>> tagged_sent = None
>>> tagged_sent
>>> with open('tagged_sent.pkl', 'rb') as fin:
... tagged_sent = pickle.load(fin)
...
>>> tagged_sent
[('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]
爲什麼字符串函數會比NLTK方法慢嗎? – Denziloe
因爲我不相信NLTK(或Textblob)開發者創建列表然後改變它(效率低下!),但他們可以直接在文本中創建POS。 – user3147590
使用'/'連接詞和pos會導致併發症。如果你的文本中有「他/她」,你會得到「他/ PN ///她/ PN」,然後解析字符串輸出將會是一團糟。 – alvas