NLTK樸素貝葉斯分類器培訓問題

我試圖訓練分類器的推文。然而，問題在於它說分類器具有100％的準確性，並且最豐富的特徵列表不顯示任何內容。有誰知道我做錯了什麼？我相信我對分類器的所有輸入都是正確的，所以我不知道它出錯的地方。NLTK樸素貝葉斯分類器培訓問題

這是我使用的數據集： http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

這是我的代碼：

import nltk 
import random 

file = open('Train/train.txt', 'r') 


documents = [] 
all_words = []   #TODO remove punctuation? 
INPUT_TWEETS = 3000 

print("Preprocessing...") 
for line in (file): 

    # Tokenize Tweet content 
    tweet_words = nltk.word_tokenize(line[2:]) 

    sentiment = "" 
    if line[0] == 0: 
     sentiment = "negative" 
    else: 
     sentiment = "positive" 
    documents.append((tweet_words, sentiment)) 

    for word in tweet_words: 
     all_words.append(word.lower()) 

    INPUT_TWEETS = INPUT_TWEETS - 1 
    if INPUT_TWEETS == 0: 
     break 

random.shuffle(documents) 


all_words = nltk.FreqDist(all_words) 

word_features = list(all_words.keys())[:3000] #top 3000 words 

def find_features(document): 
    words = set(document) 
    features = {} 
    for w in word_features: 
     features[w] = (w in words) 

    return features 

#Categorize as positive or Negative 
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents] 


training_set = feature_set[:1000] 
testing_set = feature_set[1000:] 

print("Training...") 
classifier = nltk.NaiveBayesClassifier.train(training_set) 

print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100) 
classifier.show_most_informative_features(15)

來源

2017-04-04 Daniel Medina

貌似問題是在'行中的[0]'用''int'比較0'。我懷疑你的輸入實際上使用空字節來表示負面情緒。 – alexis

。在你的代碼一個錯字：

feature_set = [（find_features（all_words ），情緒）for（all_words，endentment）in documents]

This ca使用sentiment始終具有相同的值（即預處理步驟中最後一條推文的值），因此培訓毫無意義，並且所有功能都無關緊要。

修復它，你將獲得：

('Naive Bayes Accuracy:', 66.75) 
Most Informative Features 
        -- = True   positi : negati =  6.9 : 1.0 
       these = True   positi : negati =  5.6 : 1.0 
       face = True   positi : negati =  5.6 : 1.0 
       saw = True   positi : negati =  5.6 : 1.0 
        ] = True   positi : negati =  4.4 : 1.0 
       later = True   positi : negati =  4.4 : 1.0 
       love = True   positi : negati =  4.1 : 1.0 
        ta = True   positi : negati =  4.0 : 1.0 
       quite = True   positi : negati =  4.0 : 1.0 
       trying = True   positi : negati =  4.0 : 1.0 
       small = True   positi : negati =  4.0 : 1.0 
       thx = True   positi : negati =  4.0 : 1.0 
       music = True   positi : negati =  4.0 : 1.0 
        p = True   positi : negati =  4.0 : 1.0 
      husband = True   positi : negati =  4.0 : 1.0

來源

2017-04-04 20:30:29 acidtobi

我改變了錯字，但我的輸出沒有改變它仍然是100％，並沒有顯示功能 –

那麼也許你的train.txt已損壞/不完整？我使用'df = pd.read_csv（'Sentiment Analysis Dataset.csv'，error_bad_lines = False，encoding ='utf-8'）將原始數據讀入DataFrame中，並使用'df.iterrows（）'遍歷行。得到粘貼在上面的輸出。 – acidtobi

你能告訴我閱讀.csv的整個代碼嗎？ –

NLTK樸素貝葉斯分類器培訓問題

回答

相關問題