2015-10-05 54 views
1

今天早上我問了一個問題,但是我刪除了這個問題,並在這裏發佈了更多的betterer措辭。如何將Naive Bayes模型應用到新數據中

我使用火車和測試數據創建了我的第一個機器學習模型。我返回了一個混淆矩陣,並看到一些彙總統計信息。

我現在想將模型應用於新數據來做出預測,但我不知道如何。

上下文:預測每月「流失」取消。目標變量是「攪動」的,它有兩個可能的標籤「攪動」和「不攪動」。

head(tdata) 
    months_subscription nvk_medium        org_type  churned 
1     25  none        Community not churned 
2     7  none       Sports clubs not churned 
3     28  none       Sports clubs not churned 
4     18 unknown Religious congregations and communities not churned 
5     15  none    Association - Professional not churned 
6     9  none    Association - Professional not churned 

這裏是我的培訓和測試:

library("klaR") 
library("caret") 

# import data 
test_data_imp <- read.csv("tdata.csv") 

# subset only required vars 
# had to remove "revenue" since all churned records are 0 (need last price point) 
variables <- c("months_subscription", "nvk_medium", "org_type", "churned") 
tdata <- test_data_imp[variables] 

#training 
rn_train <- sample(nrow(tdata), 
        floor(nrow(tdata)*0.75)) 
train <- tdata[rn_train,] 
test <- tdata[-rn_train,] 
model <- NaiveBayes(churned ~., data=train) 

# testing 
predictions <- predict(model, test) 
confusionMatrix(test$churned, predictions$class) 

了一切到這裏工作得很好。

現在我有了新的數據,結構和佈局方式與上面的tdata相同。我怎樣才能將我的模型應用於這些新數據來做出預測?直覺上,我正在尋找一個新的專欄,每個記錄都有預測的類別。

我嘗試這樣做:

## prediction ## 
# import data 
data_imp <- read.csv("pdata.csv") 
pdata <- data_imp[variables] 

actual_predictions <- predict(model, pdata) 

#append to data and output (as head by default) 
predicted_data <- cbind(pdata, actual_predictions$class) 

# output 
head(predicted_data) 

哪個扔錯誤

actual_predictions <- predict(model, pdata) 
Error in object$tables[[v]][, nd] : subscript out of bounds 
In addition: Warning messages: 
1: In FUN(1:6433[[4L]], ...) : 
    Numerical 0 probability for all classes with observation 1 
2: In FUN(1:6433[[4L]], ...) : 
    Numerical 0 probability for all classes with observation 2 
3: In FUN(1:6433[[4L]], ...) : 
    Numerical 0 probability for all classes with observation 3 

我如何能將我的模型到新的數據?我想要一個新的數據框與一個新的列有預測的類?

**下面的註釋,這裏是頭部和預測的新數據的STR **

head(pdata) 
    months_subscription nvk_medium        org_type  churned 
1     26  none        Community not churned 
2     8  none       Sports clubs not churned 
3     30  none       Sports clubs not churned 
4     19 unknown Religious congregations and communities not churned 
5     16  none    Association - Professional not churned 
6     10  none    Association - Professional not churned 
> str(pdata) 
'data.frame': 6433 obs. of 4 variables: 
$ months_subscription: int 26 8 30 19 16 10 3 5 14 2 ... 
$ nvk_medium   : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ... 
$ org_type   : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ... 
$ churned   : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ... 
+0

如何在變量'pdata'的數據是什麼樣子?你可以加上'head(pdata)'的結果嗎? – tguzella

+0

嗨@tguzella與tdata完全相同,除了攪動的所有實例都表示「不攪動」(因爲我想預測哪個會攪動「 –

+0

好吧,考慮到錯誤,我傾向於認爲數據不一樣'tdata' ...這個錯誤似乎是在處理一個功能時觸發的,但是,如果你不顯示數據,那麼根本不可能知道出了什麼問題 – tguzella

回答

1

這很可能是由不匹配的因素在訓練數據(可變tdata編碼造成的案例)以及predict函數(變量pdata)中使用的新數據,通常您在測試數據中具有不存在於訓練數據中的因子級別。功能編碼的一致性必須由您執行,因爲predict函數不會檢查它。因此,我建議您仔細檢查兩個變量中的功能nvk_mediumorg_type的功能級別。

錯誤消息:

Error in object$tables[[v]][, nd] : subscript out of bounds 

評估在數據點的給定功能部件(v個特徵),其中nd是對應於該特徵的因子的數值時上升。您還有警告,表明數據點(「觀察」)1,2和3中所有情況的後驗概率都爲零,但不清楚這是否也與這些因素的編碼有關。 。

要重現您所看到的錯誤,請考慮以下的玩具數據(從http://amunategui.github.io/binary-outcome-modeling/),其中有一組功能有點類似於您的數據:

# Data setup 
# From http://amunategui.github.io/binary-outcome-modeling/ 
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t') 
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing')))) 
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T) 
titanicDF$Survived <- as.factor(titanicDF$Survived) 
titanicDF <- titanicDF[c('PClass', 'Age', 'Sex', 'Title', 'Survived')] 

# Separate into training and test data 
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE) 
Data_train <- titanicDF[inds_train, , drop = FALSE] 
Data_test <- titanicDF[-inds_train, , drop = FALSE] 

有:

> str(Data_train) 

'data.frame': 656 obs. of 5 variables: 
    $ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ... 
$ Age  : num 35 28 34 28 29 28 28 28 45 28 ... 
$ Sex  : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ... 
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ... 
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ... 

> str(Data_test) 

'data.frame': 657 obs. of 5 variables: 
    $ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ... 
$ Age  : num 47 63 39 58 19 28 50 37 25 39 ... 
$ Sex  : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ... 
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ... 
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ... 

然後繼續如預期的一切:

model <- NaiveBayes(Survived ~ ., data = Data_train) 

# This will work 
pred_1 <- predict(model, Data_test) 

> str(pred_1) 
List of 2 
$ class : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ... 
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ... 
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ... 
..- attr(*, "dimnames")=List of 2 
.. ..$ : chr [1:657] "6" "7" "8" "9" ... 
.. ..$ : chr [1:2] "0" "1" 

但是,如果編碼是不相符的,例如:

# Mess things up, by "displacing" the factor values (i.e., 'Nothing' 
# will now be encoded as number 5, which was not present in the 
# training data) 
Data_test_2 <- Data_test 
Data_test_2$Title <- factor(
    as.character(Data_test_2$Title), 
    levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing") 
) 

> str(Data_test_2) 

'data.frame': 657 obs. of 5 variables: 
    $ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ... 
$ Age  : num 47 63 39 58 19 28 50 37 25 39 ... 
$ Sex  : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ... 
$ Title : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ... 
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ... 

則:

> pred_2 <- predict(model, Data_test_2) 
Error in object$tables[[v]][, nd] : subscript out of bounds 
+0

非常感謝您的意見,我看了中等和org_type,發現一個長尾巴的水平低計數通過將他的差異(水平?)減少到6來將他們分組到更高級別。一切都按預期工作!謝謝 –

相關問題