我試圖估計nltk電影評論語料庫樸素貝葉斯分類的準確性。交叉驗證錯誤
from nltk.corpus import movie_reviews
import random
import nltk
from sklearn import cross_validation
from nltk.corpus import stopwords
import string
from nltk.classify import apply_features
def document_features(document):
document_words = set(document)
features = {}
for word in unigrams:
features['contains({})'.format(word)] = (word in document_words)
return features
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
stop = stopwords.words('english')
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words() if w.lower() not in stop and w.lower() not in string.punctuation)
unigrams = list(all_words)[:200]
featuresets = [(document_features(d), c) for (d,c) in documents]
我試圖執行10倍交叉驗證的,我已採取從sklearn一個例子。
training_set = nltk.classify.apply_features(featuresets, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)
for traincv, testcv in cv:
classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
result = nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])
print 'Accuracy:', result
但我在該行
classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
「列表」對象得到一個錯誤是不可呼叫
任何想法我做錯了什麼?
非常感謝您的回答!如果我刪除'training_set = nltk.classify.apply_features(特徵集,文檔)'並說'training_set =特徵集',它就起作用了。我是否喜歡使用所有文檔功能,而不僅僅是200個unigrams?用'training_set = nltk.classify。apply_features(featuresets,documents,True)'我得到同樣的錯誤。 – student
對不起:第二種方式必須是:'training_set = nltk.classify.apply_features(document_features,documents,True)'(使用特徵提取函數而不是實際結果)。但是,您將始終使用前200名unigrams,因爲您已將它們定義爲布爾特性。如果您使用出現在任何文檔中的所有單詞,則特徵空間將非常大。 – Callidior