當使用NLTK處理大約5000篇與PlainTextCorpusReader的帖子時,我遇到了一個奇怪的斷言錯誤。對於我們的一些數據集,我們沒有任何重大問題。但是,在個別情況下,我會見了:Python NLTK標記斷言錯誤
File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError
我的代碼工作(基本上)像這樣:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())
好像NLTK正在失去其發生在文件緩衝區,但我這不是100%。任何想法可能會導致這種情況發生?幾乎看起來它必須與我正在處理的數據有關。也許有些時髦的角色?