交叉驗證錯誤

我試圖估計nltk電影評論語料庫樸素貝葉斯分類的準確性。交叉驗證錯誤

from nltk.corpus import movie_reviews 
import random 
import nltk 
from sklearn import cross_validation 
from nltk.corpus import stopwords 
import string 
from nltk.classify import apply_features 

def document_features(document): 
    document_words = set(document) 
    features = {} 
    for word in unigrams: 
     features['contains({})'.format(word)] = (word in document_words) 
    return features 

documents = [(list(movie_reviews.words(fileid)), category) 
      for category in movie_reviews.categories() 
      for fileid in movie_reviews.fileids(category)] 
random.shuffle(documents) 
stop = stopwords.words('english') 
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words() if w.lower() not in stop and w.lower() not in string.punctuation) 
unigrams = list(all_words)[:200] 
featuresets = [(document_features(d), c) for (d,c) in documents]

我試圖執行10倍交叉驗證的，我已採取從sklearn一個例子。

training_set = nltk.classify.apply_features(featuresets, documents) 
cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None) 

for traincv, testcv in cv: 
    classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]]) 
    result = nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]]) 
    print 'Accuracy:', result

但我在該行

classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])

「列表」對象得到一個錯誤是不可呼叫

任何想法我做錯了什麼？

來源

2016-03-14 student

實際的錯誤就在於這一行：

training_set = nltk.classify.apply_features(featuresets, documents)

featuresets是列表的Python抱怨。

從nltk.classify.apply_features文檔：

apply_features（feature_func，toks，標記=無）

使用LazyMap類構造一個懶惰列表類似對象，它是類似於map(feature_func, toks) 。在特別是，如果labeled=False，則返回列表類似對象的值等於：
[feature_func(tok) for tok in toks] 
如果labeled=True，則返回列表類似的對象的值等於：
[(feature_func(tok), label) for (tok, label) in toks] 

以與map類似的方式表現，該功能期望功能（特徵提取器）作爲第一參數，其將被應用於e作爲第二個參數傳遞的列表的非常元素（文檔）。它返回一個LazyMap，它應用按需功能來節省內存。

但是，您已將功能集列表傳遞給apply_features而不是特徵提取函數。因此，有兩種可能的解決方案，使事情的工作就像你希望他們：

丟棄training_set和使用featuresets代替。
放棄featuresets並使用training_set = nltk.classify.apply_features(document_features, documents, True)（注意第三個參數）。

我推薦第二個選項，因爲它沒有構造內存中所有文檔的特徵列表。

來源

2016-03-14 15:54:53 Callidior

非常感謝您的回答！如果我刪除'training_set = nltk.classify.apply_features（特徵集，文檔）'並說'training_set =特徵集'，它就起作用了。我是否喜歡使用所有文檔功能，而不僅僅是200個unigrams？用'training_set = nltk.classify。apply_features（featuresets，documents，True）'我得到同樣的錯誤。 – student

對不起：第二種方式必須是：'training_set = nltk.classify.apply_features（document_features，documents，True）'（使用特徵提取函數而不是實際結果）。但是，您將始終使用前200名unigrams，因爲您已將它們定義爲布爾特性。如果您使用出現在任何文檔中的所有單詞，則特徵空間將非常大。 – Callidior

交叉驗證錯誤

回答

相關問題