2016-11-21 56 views
4

我正在嘗試將predict方法的結果與pandas.DataFrame對象中的原始數據合併。從model.predict()與原始熊貓DataFrame合併結果?

from sklearn.datasets import load_iris 
from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
import numpy as np 

data = load_iris() 

# bear with me for the next few steps... I'm trying to walk you through 
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset 
# put feature matrix into columnar format in dataframe 
df = pd.DataFrame(data = data.data) 

# add outcome variable 
df['class'] = data.target 

X = np.matrix(df.loc[:, [0, 1, 2, 3]]) 
y = np.array(df['class']) 

# finally, split into train-test 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8) 

model = DecisionTreeClassifier() 

model.fit(X_train, y_train) 

# I've got my predictions now 
y_hats = model.predict(X_test) 

爲了這些預測會合並回原來的df,我試試這個:

df['y_hats'] = y_hats 

然而這卻引起:

ValueError: Length of values does not match length of index

我知道我可以在df分成train_dftest_df這個問題將被解決,但實際上我需要按照上面的路徑創建矩陣Xy(我的實際問題是文本分類問題,其中我在分解成列車和測試之前將特徵矩陣規格化爲整個整個)。如何將這些預測值與我的df中的相應行對齊,因爲y_hats數組是零索引的,並且似乎有關哪些行包含在X_testy_test中的所有信息都已丟失?或者我會被降級爲首先將數據框分解爲火車測試,然後再構建特徵矩陣?我想只填寫train中包含np.nan值的行。

+1

我相信'sklearn'支持'DataFrames'和'Series'作爲參數傳遞給'train_test_split'所以應該通過傳遞你的子部分工作除了返回的是索引,您可以使用這些索引使用'iloc'將索引重新導入到df中,請參閱文檔:http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split .html – EdChum

回答

4

您的y_hats長度只會是測試數據上的長度(20%),因爲您是在X_test上預測的。一旦您的模型得到驗證,並且您對測試預測滿意(通過檢查X_test預測與X_test真實值相比您的模型的準確性),則應該重新運行完整數據集(X)的預測。這兩行添加到底層:

y_hats2 = model.predict(X) 

df['y_hats'] = y_hats2 

編輯按您的評論,這裏是一個更新的結果返回添加他們在那裏測試datset預測數據集

from sklearn.datasets import load_iris 
from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
import numpy as np 

data = load_iris() 

# bear with me for the next few steps... I'm trying to walk you through 
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset 
# put feature matrix into columnar format in dataframe 
df = pd.DataFrame(data = data.data) 

# add outcome variable 
df_class = pd.DataFrame(data = data.target) 

# finally, split into train-test 
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8) 

model = DecisionTreeClassifier() 

model.fit(X_train, y_train) 

# I've got my predictions now 
y_hats = model.predict(X_test) 

y_test['preds'] = y_hats 

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True) 
+0

這並不能真正解決我僅合併那些在'test'中開頭的數據的問題。如果你爲每一行合併了預測,你怎麼知道哪些是原始的「測試」矩陣?據我所知,我可以運行你添加的行,但不知道模型是否已經看到X中的一些行(因此使列車測試的整個目的失效)。 – blacksite

+0

更新 - 讓我知道如果沒有解決它。 – flyingmeatball

-1

你還可以使用

y_hats = model.predict(X) 

df['y_hats'] = y_hats.reset_index()['name of the target column']