如何在字符串中輸出NLTK pos_tag而不是列表？

我需要在大型數據集上運行nltk.pos_tag，並且需要輸出類似於斯坦福標記器提供的輸出。如何在字符串中輸出NLTK pos_tag而不是列表？

例如，當我運行下面的代碼時;

import nltk 
text=nltk.word_tokenize("We are going out.Just you and me.") 
print nltk.pos_tag(text)

輸出爲： [（ '我們'， 'PRP'），（ '是'， 'VBP'），（ '走出'， 'VBG'），（ 'out.Just'， 'IN'），（'you'，'PRP'），（'and'，'CC'），（'me'，'PRP'），（'。'，'。'）]

In我需要它的情況是這樣的：

We/PRP are/VBP going/VBG out.Just/NN you/PRP and/CC me/PRP ./.

我寧願不使用字符串函數，需要一個dirrect輸出，因爲文本的數量是如此之高，它增加了大量的時間複雜度的處理

來源

2017-03-17 user3147590

爲什麼字符串函數會比NLTK方法慢嗎？ – Denziloe

因爲我不相信NLTK（或Textblob）開發者創建列表然後改變它（效率低下！），但他們可以直接在文本中創建POS。 – user3147590

使用'/'連接詞和pos會導致併發症。如果你的文本中有「他/她」，你會得到「他/ PN ///她/ PN」，然後解析字符串輸出將會是一團糟。 – alvas

簡而言之：

' '.join([word + '/' + pos for word, pos in tagged_sent]

在長：

我想你大概得太多使用字符串函數Concat的字符串，它真的不貴。

import time 
from nltk.corpus import brown 

tagged_corpus = brown.tagged_sents() 

start = time.time() 

with open('output.txt', 'w') as fout: 
    for i, sent in enumerate(tagged_corpus): 
     print(' '.join([word + '/' + pos for word, pos in sent]), end='\n', file=fout) 

end = time.time() - start 
print (i, end)

花了2.955秒我的筆記本電腦從棕色語料庫所有57339分的句子。

[出]：

$ head -n1 output.txt 
The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

但使用字符串來連接單詞和POS可以在需要讀取你的標記輸出，例如在以後帶來麻煩

>>> from nltk import pos_tag 
>>> tagged_sent = pos_tag('cat/dog'.split()) 
>>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent]) 
>>> tagged_sent_str 
'cat/NN //CD dog/NN' 
>>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()] 
[('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]

如果你想保存在標籤的輸出，再後來讀它，最好使用pickle保存tagged_output，例如

>>> import pickle 
>>> tagged_sent = pos_tag('cat/dog'.split()) 
>>> with open('tagged_sent.pkl', 'wb') as fout: 
...  pickle.dump(tagged_sent, fout) 
... 
>>> tagged_sent = None 
>>> tagged_sent 
>>> with open('tagged_sent.pkl', 'rb') as fin: 
...  tagged_sent = pickle.load(fin) 
... 
>>> tagged_sent 
[('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]

來源

2017-03-17 23:49:14 alvas

如何在字符串中輸出NLTK pos_tag而不是列表？

回答

相關問題