從model.predict（）與原始熊貓DataFrame合併結果？

我正在嘗試將predict方法的結果與pandas.DataFrame對象中的原始數據合併。從model.predict（）與原始熊貓DataFrame合併結果？

from sklearn.datasets import load_iris 
from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
import numpy as np 

data = load_iris() 

# bear with me for the next few steps... I'm trying to walk you through 
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset 
# put feature matrix into columnar format in dataframe 
df = pd.DataFrame(data = data.data) 

# add outcome variable 
df['class'] = data.target 

X = np.matrix(df.loc[:, [0, 1, 2, 3]]) 
y = np.array(df['class']) 

# finally, split into train-test 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8) 

model = DecisionTreeClassifier() 

model.fit(X_train, y_train) 

# I've got my predictions now 
y_hats = model.predict(X_test)

爲了這些預測會合並回原來的df，我試試這個：

df['y_hats'] = y_hats

然而這卻引起：

ValueError: Length of values does not match length of index

我知道我可以在df分成train_df和test_df這個問題將被解決，但實際上我需要按照上面的路徑創建矩陣X和y（我的實際問題是文本分類問題，其中我在分解成列車和測試之前將特徵矩陣規格化爲整個整個）。如何將這些預測值與我的df中的相應行對齊，因爲y_hats數組是零索引的，並且似乎有關哪些行包含在X_test和y_test中的所有信息都已丟失？或者我會被降級爲首先將數據框分解爲火車測試，然後再構建特徵矩陣？我想只填寫train中包含np.nan值的行。

來源

2016-11-21 blacksite

我相信'sklearn'支持'DataFrames'和'Series'作爲參數傳遞給'train_test_split'所以應該通過傳遞你的子部分工作除了返回的是索引，您可以使用這些索引使用'iloc'將索引重新導入到df中，請參閱文檔：http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split .html – EdChum

您的y_hats長度只會是測試數據上的長度（20％），因爲您是在X_test上預測的。一旦您的模型得到驗證，並且您對測試預測滿意（通過檢查X_test預測與X_test真實值相比您的模型的準確性），則應該重新運行完整數據集（X）的預測。這兩行添加到底層：

y_hats2 = model.predict(X) 

df['y_hats'] = y_hats2

編輯按您的評論，這裏是一個更新的結果返回添加他們在那裏測試datset預測數據集

from sklearn.datasets import load_iris 
from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
import numpy as np 

data = load_iris() 

# bear with me for the next few steps... I'm trying to walk you through 
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset 
# put feature matrix into columnar format in dataframe 
df = pd.DataFrame(data = data.data) 

# add outcome variable 
df_class = pd.DataFrame(data = data.target) 

# finally, split into train-test 
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8) 

model = DecisionTreeClassifier() 

model.fit(X_train, y_train) 

# I've got my predictions now 
y_hats = model.predict(X_test) 

y_test['preds'] = y_hats 

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

來源

2016-11-21 21:04:35 flyingmeatball

這並不能真正解決我僅合併那些在'test'中開頭的數據的問題。如果你爲每一行合併了預測，你怎麼知道哪些是原始的「測試」矩陣？據我所知，我可以運行你添加的行，但不知道模型是否已經看到X中的一些行（因此使列車測試的整個目的失效）。 – blacksite

更新 - 讓我知道如果沒有解決它。 – flyingmeatball

-1

你還可以使用

y_hats = model.predict(X) 

df['y_hats'] = y_hats.reset_index()['name of the target column']

來源

2017-12-20 13:58:08 ambar003

從model.predict（）與原始熊貓DataFrame合併結果？

回答

相關問題