我正在嘗試將predict
方法的結果與pandas.DataFrame
對象中的原始數據合併。從model.predict()與原始熊貓DataFrame合併結果?
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
爲了這些預測會合並回原來的df
,我試試這個:
df['y_hats'] = y_hats
然而這卻引起:
ValueError: Length of values does not match length of index
我知道我可以在df
分成train_df
和test_df
這個問題將被解決,但實際上我需要按照上面的路徑創建矩陣X
和y
(我的實際問題是文本分類問題,其中我在分解成列車和測試之前將特徵矩陣規格化爲整個整個)。如何將這些預測值與我的df
中的相應行對齊,因爲y_hats
數組是零索引的,並且似乎有關哪些行包含在X_test
和y_test
中的所有信息都已丟失?或者我會被降級爲首先將數據框分解爲火車測試,然後再構建特徵矩陣?我想只填寫train
中包含np.nan
值的行。
我相信'sklearn'支持'DataFrames'和'Series'作爲參數傳遞給'train_test_split'所以應該通過傳遞你的子部分工作除了返回的是索引,您可以使用這些索引使用'iloc'將索引重新導入到df中,請參閱文檔:http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split .html – EdChum