2017-04-04 54 views
0

我試圖訓練分類器的推文。然而,問題在於它說分類器具有100%的準確性,並且最豐富的特徵列表不顯示任何內容。有誰知道我做錯了什麼?我相信我對分類器的所有輸入都是正確的,所以我不知道它出錯的地方。NLTK樸素貝葉斯分類器培訓問題

這是我使用的數據集: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

這是我的代碼:

import nltk 
import random 

file = open('Train/train.txt', 'r') 


documents = [] 
all_words = []   #TODO remove punctuation? 
INPUT_TWEETS = 3000 

print("Preprocessing...") 
for line in (file): 

    # Tokenize Tweet content 
    tweet_words = nltk.word_tokenize(line[2:]) 

    sentiment = "" 
    if line[0] == 0: 
     sentiment = "negative" 
    else: 
     sentiment = "positive" 
    documents.append((tweet_words, sentiment)) 

    for word in tweet_words: 
     all_words.append(word.lower()) 

    INPUT_TWEETS = INPUT_TWEETS - 1 
    if INPUT_TWEETS == 0: 
     break 

random.shuffle(documents) 


all_words = nltk.FreqDist(all_words) 

word_features = list(all_words.keys())[:3000] #top 3000 words 

def find_features(document): 
    words = set(document) 
    features = {} 
    for w in word_features: 
     features[w] = (w in words) 

    return features 

#Categorize as positive or Negative 
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents] 


training_set = feature_set[:1000] 
testing_set = feature_set[1000:] 

print("Training...") 
classifier = nltk.NaiveBayesClassifier.train(training_set) 

print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100) 
classifier.show_most_informative_features(15) 
+1

貌似問題是在'行中的[0]'用''int'比較0'。我懷疑你的輸入實際上使用空字節來表示負面情緒。 – alexis

回答

1

。在你的代碼一個錯字:

feature_set = [(find_features(all_words ),情緒)for(all_words,endentment)in documents]

This ca使用sentiment始終具有相同的值(即預處理步驟中最後一條推文的值),因此培訓毫無意義,並且所有功能都無關緊要。

修復它,你將獲得:

('Naive Bayes Accuracy:', 66.75) 
Most Informative Features 
        -- = True   positi : negati =  6.9 : 1.0 
       these = True   positi : negati =  5.6 : 1.0 
       face = True   positi : negati =  5.6 : 1.0 
       saw = True   positi : negati =  5.6 : 1.0 
        ] = True   positi : negati =  4.4 : 1.0 
       later = True   positi : negati =  4.4 : 1.0 
       love = True   positi : negati =  4.1 : 1.0 
        ta = True   positi : negati =  4.0 : 1.0 
       quite = True   positi : negati =  4.0 : 1.0 
       trying = True   positi : negati =  4.0 : 1.0 
       small = True   positi : negati =  4.0 : 1.0 
       thx = True   positi : negati =  4.0 : 1.0 
       music = True   positi : negati =  4.0 : 1.0 
        p = True   positi : negati =  4.0 : 1.0 
      husband = True   positi : negati =  4.0 : 1.0 
+0

我改變了錯字,但我的輸出沒有改變它仍然是100%,並沒有顯示功能 –

+0

那麼也許你的train.txt已損壞/不完整?我使用'df = pd.read_csv('Sentiment Analysis Dataset.csv',error_bad_lines = False,encoding ='utf-8')將原始數據讀入DataFrame中,並使用'df.iterrows()'遍歷行。得到粘貼在上面的輸出。 – acidtobi

+0

你能告訴我閱讀.csv的整個代碼嗎? –