TaggedCorpusReader和UnigramTagger in nltk（python）

我正在嘗試使用nltk以非常低保真的方式對新聞文章進行自動分類。我創建了一個與我的類別（即老師/ EDU，計算機/技術等）相關的單詞/標籤對的自定義語料庫，我一直在閱讀和this question讓我非常接近，但我仍然陷入困境。TaggedCorpusReader和UnigramTagger in nltk（python）

根據我的代碼到目前爲止，如何使用我的標記器來標記我的句子？

import nltk 

# Loads my custom word/tag corpus 
from nltk.corpus.reader import TaggedCorpusReader 
reader = TaggedCorpusReader('taggers','.*') 

#Sets up the UnigramTagger 
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) 
tagger = nltk.tag.UnigramTagger(model=reader.tagged_words(), backoff=default_tagger) 

#Sample content 
sent = 'The students went to school to ask their teacher what the homework for the day was but she told them to check their email.' 
tokens = nltk.tokenize.word_tokenize(sent) 

# Sad Panda 
tagged = tagger.tag(tokens) 
#^produces AttributeError: 'ConcatenatedCorpusView' object has no attribute 'get'

這也是非常可能的，這是一個好辦法去這樣做我想要做的，但它似乎是第一次運行不夠好。提前致謝。

來源

2011-12-28 Eric Arenson

標記符用於詞性標記，而不是文本分類。看看路透社語料庫 - 它使用類別文件將新聞文章分爲多個類別。然後查看nltk.classify模塊並閱讀如何訓練文本分類器。

來源

2011-12-28 19:05:45 Jacob

謝謝雅各布，你指出我在正確的方向。術語是找到正確途徑的關鍵。謝謝！ – 2011-12-29 16:14:21

TaggedCorpusReader和UnigramTagger in nltk（python）

回答

相關問題