存儲一個POS標註語料

我使用NLTK和POS標籤德語維基百科與後援結構相當簡單含每一句話作爲單詞的列表，POS標籤元組例如一個大名單：存儲一個POS標註語料

[[(Word1,POS),(Word2,POS),...],[(Word1,POS),(Word2,POS),...],...]

由於維基百科很大，我顯然無法在內存中存儲整個大列表，所以我需要一種方法將其中的一部分保存到磁盤。以某種方式做這件事會是一種很好的方式，以便以後可以輕鬆地從磁盤遍歷所有句子和單詞？

來源

2014-09-30 Thagor

使用pickle，看https://wiki.python.org/moin/UsingPickle：

import io 
import cPickle as pickle 

from nltk import pos_tag 
from nltk.corpus import brown 

print brown.sents() 
print 

# Let's tag the first 10 sentences. 
tagged_corpus = [pos_tag(i) for i in brown.sents()[:10]] 

with io.open('brown.pos', 'wb') as fout: 
    pickle.dump(tagged_corpus, fout) 

with io.open('brown.pos', 'rb') as fin: 
    loaded_corpus = pickle.load(fin) 

for sent in loaded_corpus: 
    print sent 
    break

[出]：

[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...] 

[(u'The', 'DT'), (u'Fulton', 'NNP'), (u'County', 'NNP'), (u'Grand', 'NNP'), (u'Jury', 'NNP'), (u'said', 'VBD'), (u'Friday', 'NNP'), (u'an', 'DT'), (u'investigation', 'NN'), (u'of', 'IN'), (u"Atlanta's", 'JJ'), (u'recent', 'JJ'), (u'primary', 'JJ'), (u'election', 'NN'), (u'produced', 'VBN'), (u'``', '``'), (u'no', 'DT'), (u'evidence', 'NN'), (u"''", "''"), (u'that', 'WDT'), (u'any', 'DT'), (u'irregularities', 'NNS'), (u'took', 'VBD'), (u'place', 'NN'), (u'.', '.')]

來源

2014-09-30 15:05:04 alvas

可以以某種方式將新數據附加到pickle對象而不將其完全加載到內存中嗎？因爲也許我錯了，我仍然需要在整個語料庫（大概9-10GB）內存中，然後才能正確地轉儲它。 – Thagor 2014-09-30 17:58:27

其實我會建議多個泡菜，如果你懶惰加載，但最好的解決方案仍然是從文本文件解析POS標記的語料庫，然後處理它作爲您的解析。這不會更便攜嗎？想象一下，一位Java用戶想要在某天使用您的語料庫，另一天使用Ruby用戶，第二天使用某些Go或任何新的編程語言用戶。 – alvas 2014-10-01 01:12:02

的適當要做的就是保存標註語料格式的NLTK的TaggedCorpusReader預計：使用斜槓/來組合單詞和標籤，並分別編寫每個令牌。也就是說，你最終會得到Word1/POS word2/POS word3/POS ...。

由於某種原因，nltk沒有提供這樣做的功能。有一個詞，它的標籤，這是不值得的麻煩來查找，因爲它是很容易做到直接整件事合併功能：

for tagged_sent in tagged_sentences: 
    text = " ".join(w+"/"+t for w,t in tagged_sent) 
    outfile.write(text+"\n")

就是這樣。稍後，您可以使用TaggedCorpusReader來讀取您的語料庫並以NLTK提供的常用方式（通過帶標籤或無標籤的單詞，通過帶標籤或無標籤的句子）對其進行迭代。

來源

2016-02-25 20:19:12 alexis

存儲一個POS標註語料

回答

相關問題