2017-02-23 81 views
2

我有一個時間序列,每個月都有粒度和7個月的數據,我試圖通過前六個月的訓練來預測第7個月的盈利能力。我對數據做了80/20分割。 XGBoost提供的RMSE極低,我從其他算法中無法獲得。這讓我有點懷疑。所以我決定檢查哪些功能是最重要的,而不是功能列表中的數字。這讓我懷疑沒有正確地將數據提供給算法。我爲noob問題表示歉意,但我想我是一種。幫助將不勝感激。XGBoost輸入數據問題

require(caTools) 
require(Matrix) 
require(data.table) 
require(xgboost) 
set.seed(111) 
sample = sample.split(new_flat$SUBSCRIPTION_ID, SplitRatio = .80) 
train = subset(new_flat, sample == TRUE) 
train <- subset(train, select = -SUBSCRIPTION_ID) #Removing Subscription_id 
test = subset(new_flat, sample == FALSE) 
test <- subset(test, select = -SUBSCRIPTION_ID) #Removing Subscription_id 
target=test$Total_MARGIN_7 #Value I want to predict in the test set 
dtrain <- xgb.DMatrix(data = as.matrix(train), label = train[,7])# I think this is the problem here 
dtest <- xgb.DMatrix(data = as.matrix(test), label = test[,7]) ])# I think this is the problem here 

bst <- xgboost(data = dtrain, max_depth = 5, eta = 1, nrounds = 20, 
       objective = "reg:linear") 
pred <- predict(bst, dtest) 
mean(pred) 
RMSE <- sqrt(mean((as.numeric(target) - pred)^2)) # Yes as.numeric is redundant here 
RMSE 
+0

我不知道,如果XG升壓是時間序列好的算法。你能顯示一些樣本數據嗎? –

+0

您是否將功能編號作爲輸出或其他內容? –

+0

我不能分享數據不幸的是,你可能是對的,xgboost可能不是最好的時間序列,但我只是試一試。 –

回答

0

由於輸入數據中的作弊行爲經常發生非常「好」的表現。在這裏,因變量已被刪除:

dtrain <- xgb.DMatrix(data = as.matrix(train)[,-7], label = train[,7]) 
dtest <- xgb.DMatrix(data = as.matrix(test)[,-7], label = test[,7])