在Python中使用我自己的語料庫進行分類NLTK

我是NTLK/Python初學者，並設法使用CategorizedPlaintextCorpusReader加載我自己的語料庫，但我如何實際訓練和使用數據進行文本分類？在Python中使用我自己的語料庫進行分類NLTK

>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader 
>>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt') 
>>> len(reader.categories()) 
234

來源

2012-01-11 jonasl

看到http://stackoverflow.com /問題/ 29275614 /使用 - 我 - 自己的語料庫，而不是 - 電影 - - 評論 - 語料庫的分類功能於NLTK – alvas 2015-03-26 14:51:27

假設你想要一個樸素貝葉斯分類器與詞袋特點：

from nltk import FreqDist 
from nltk.classify.naivebayes import NaiveBayesClassifier 

def make_training_data(rdr): 
    for c in rdr.categories(): 
     for f in rdr.fileids(c): 
      yield FreqDist(rdr.words(fileids=[f])), c 

clf = NaiveBayesClassifier.train(list(make_training_data(reader)))

產生的clf的classify方法可以在任何單詞FreqDist使用。

（但請注意：從你的cap_pattern，看來你有樣品和每個文件的一個類別中的語料，請檢查是否這就是你要真的是。）

來源

2012-01-11 11:40:49

在Python中使用我自己的語料庫進行分類NLTK

回答

相關問題