2016-04-30 332 views
0

我試圖在R中運行樸素貝葉斯用於從文本數據進行預測(通過構建文檔術語矩陣)。樸素貝葉斯的問題

我讀了幾篇關於訓練和測試集中可能缺失的術語的警告,因此我決定只使用一個數據框並在之後進行拆分。我正在使用的代碼是這樣的:

data <- read.csv(file="path",header=TRUE) 

########## NAIVE BAYES 
library(e1071) 
library(SparseM) 
library(tm) 

# CREATE DATA FRAME AND TRAINING AND 
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27) 
traindata <- as.data.frame(data[13000:13999,c(8,27)]) 
testdata <- as.data.frame(data[14000:14999,c(8,27)]) 
complete <- as.data.frame(data[13000:14999,c(8,27)]) 

# SEPARATE TEXT VECTOR TO CREATE Source(), 
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM 
# MATRIX TAKES Source() 
completevector <- as.vector(complete$Text) 

# CREATE SOURCE FOR VECTORS 
completesource <- VectorSource(completevector) 

# CREATE CORPUS FOR DATA 
completecorpus <- Corpus(completesource) 

# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE 
completecorpus <- tm_map(completecorpus,tolower) 
     completecorpus <- tm_map(completecorpus,PlainTextDocument) 
     completecorpus <- tm_map(completecorpus, stemDocument) 
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english")) 
     completecorpus <- tm_map(completecorpus,removePunctuation) 
     completecorpus <- tm_map(completecorpus,removeNumbers) 
     completecorpus <- tm_map(completecorpus,stripWhitespace) 

# CREATE DOCUMENT TERM MATRIX 
completematrix<-DocumentTermMatrix(completecorpus) 
trainmatrix <- completematrix[1:1000,] 
testmatrix <- completematrix[1001:2000,] 

# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR 
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1) 

# PREDICTION 
results <- predict(model,as.matrix(testmatrix)) 
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual')) 

conf.matrix 

的問題是,我得到奇怪的結果是這樣的:

   actual 
predicted 1 2 3 
     1 60 833 107 
     2 0 0 0 
     3 0 0 0 

的爲什麼會這樣任何想法?

的原始數據是這樣的:

head(complete) 

     Text 
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick. 
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer. easy to clean and the mixer fits in perfectly 
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer. I can do milkshakes really easy and fast. Recommended. No problems with the shipping. 
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice. 
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it! 
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well 

     InfoType 
13000  2 
13001  2 
13002  2 
13003  3 
13004  2 
13005  2 
+0

硬盤沒有數據調試。您正在拆分火車並按特定行進行測試。這些行很可能不包含所有類。你最好隨機抽樣行測試/火車拆分。 – Gopala

+0

不,那沒用。我嘗試隨機分割行,並得到完全相同的結果。 – JorgeF

+0

只是爲了確保 - 您的混淆矩陣(預測的v實際)表示所有實際項目都屬於第1類,而不是它預測所有這些類都是第1類? – patrick

回答

0

貌似問題是TDM需要擺脫這麼多的稀疏。所以我補充說:

completematrix<-removeSparseTerms(completematrix, 0.95) 

它開始工作了!

   actual 
predicted 1 2 3 
     1 60 511 6 
     2 0 86 2 
     3 0 236 99 

謝謝大家對你的想法(謝謝Chelsey山!!)