我試着在Kaggle泰坦尼克號機器學習數據集示例,我面臨以下問題。 錯誤消息如下:因素水平保持不變,即使刪除一個水平
Error in predict.randomForest(modelFit, newtest) :
Type of predictors in new data do not match that of the training data.
這是我的全部代碼:
#Load the libraries:
library(ggplot2)
library(randomForest)
#Load the data:
set.seed(1)
train <- read.csv("train.csv")
test <- read.csv("test.csv")
gendermodel <- read.csv("gendermodel.csv")
genderclassmodel <- read.csv("genderclassmodel.csv")
#Preprocess the data and feature extraction:
features <- c("Pclass", "Age", "Sex", "Parch", "SibSp", "Fare", "Embarked")
newtrain <- train[,features]
newtest <- test[,features]
newtrain$Embarked[newtrain$Embarked==""] <- "S"
newtrain$Fare[newtrain$Fare == 0] <- median(newtrain$Fare, na.rm=TRUE)
newtrain$Age[is.na(newtrain$Age)] <- -1
newtest$Embarked[newtest$Embarked==""] <- "S"
newtest$Fare[newtest$Fare == 0] <- median(newtest$Fare, na.rm=TRUE)
newtest$Fare <- ifelse(is.na(newtest$Fare), mean(newtest$Fare, na.rm = TRUE), newtest$Fare)
newtest$Age[is.na(newtest$Age)] <- -1
#Model building
modelFit <- randomForest(newtrain, as.factor(train$Survived), ntree = 100, importance = TRUE)
predictedOutput <- data.frame(PassengerID = test$PassengerId)
predictedOutput$Survived <- predict(modelFit, newtest)
write.csv(predictedOutput, file = "TitanicPrediction.csv", row.names=FALSE)
MDA <- importance(modelFit, type=1)
featureImportance <- data.frame(Feature = row.names(MDA), Importance = MDA[,1])
#Plots
g <- ggplot(featureImportance, aes(x=Feature, y=Importance)) + geom_bar(stat="identity") + xlab("Feature") + ylab("Importance") + ggtitle("Feature importance")
ggsave("FeatureImportance.png", p)
我明白了什麼錯誤消息意味着,所以當我做str(newtrain)
和str(newtest)
,我得到即使分配newtrain$Embarked[newtrain$Embarked==""] <- "S"
後下。
str(newtrain)
'data.frame': 891 obs. of 7 variables:
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Age : num 22 38 26 35 35 -1 54 2 27 14 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
> length(which(train$Embarked == ""))
[1] 2
> length(which(newtrain$Embarked == ""))
[1] 0
當我檢查包含缺失值的train和newtrain數據集的長度時,我得到如上的正確輸出。我不知道我哪裏出錯了。任何幫助深表感謝!謝謝!
如果問題是因素水平,你嘗試過'水滴'嗎? – aosmith