2016-11-21 56 views


from sklearn.datasets import load_iris 
from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
import numpy as np 

data = load_iris() 

# bear with me for the next few steps... I'm trying to walk you through 
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset 
# put feature matrix into columnar format in dataframe 
df = pd.DataFrame(data = data.data) 

# add outcome variable 
df['class'] = data.target 

X = np.matrix(df.loc[:, [0, 1, 2, 3]]) 
y = np.array(df['class']) 

# finally, split into train-test 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8) 

model = DecisionTreeClassifier() 

model.fit(X_train, y_train) 

# I've got my predictions now 
y_hats = model.predict(X_test) 


df['y_hats'] = y_hats 


ValueError: Length of values does not match length of index



我相信'sklearn'支持'DataFrames'和'Series'作爲參數傳遞給'train_test_split'所以應該通過傳遞你的子部分工作除了返回的是索引,您可以使用這些索引使用'iloc'將索引重新導入到df中,請參閱文檔:http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split .html – EdChum




y_hats2 = model.predict(X) 

df['y_hats'] = y_hats2 


from sklearn.datasets import load_iris 
from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
import numpy as np 

data = load_iris() 

# bear with me for the next few steps... I'm trying to walk you through 
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset 
# put feature matrix into columnar format in dataframe 
df = pd.DataFrame(data = data.data) 

# add outcome variable 
df_class = pd.DataFrame(data = data.target) 

# finally, split into train-test 
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8) 

model = DecisionTreeClassifier() 

model.fit(X_train, y_train) 

# I've got my predictions now 
y_hats = model.predict(X_test) 

y_test['preds'] = y_hats 

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True) 

這並不能真正解決我僅合併那些在'test'中開頭的數據的問題。如果你爲每一行合併了預測,你怎麼知道哪些是原始的「測試」矩陣?據我所知,我可以運行你添加的行,但不知道模型是否已經看到X中的一些行(因此使列車測試的整個目的失效)。 – blacksite


更新 - 讓我知道如果沒有解決它。 – flyingmeatball



y_hats = model.predict(X) 

df['y_hats'] = y_hats.reset_index()['name of the target column']