2017-02-10 67 views
1

這樣的數組就像輸入一樣,我從.csv文件中讀取數據,但是在這裏我從列表中構建數據框,以便可以複製問題。目的是通過使用LogisticRegressionCV來交叉驗證來訓練邏輯迴歸模型。Sklearn LogisticRegressionCV

indeps = ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F'] 
dep = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 

data = [indeps, dep] 
cols = ['state', 'cat_bins'] 

data_dict = dict((x[0], x[1]) for x in zip(cols, data)) 

df = pd.DataFrame.from_dict(data_dict) 
df.tail() 

    cat_bins state 
45 0.0   F 
46 0.0   M 
47 0.0   M 
48 0.0   F 
49 0.0   F 


'''Use Pandas' to encode independent variables. Notice that 
we are returning a sparse dataframe ''' 

def heat_it2(dataframe, lst_of_columns): 
    dataframe_hot = pd.get_dummies(dataframe, 
            prefix = lst_of_columns, 
            columns = lst_of_columns, sparse=True,) 
    return dataframe_hot 

train_set_hot = heat_it2(df, ['state']) 
train_set_hot.head(2) 

    cat_bins state_F  state_M 
0  1.0   0   1 
1  1.0   1   0 

'''Use the dataframe to set up the prospective inputs to the model as numpy arrays''' 

indeps_hot = ['state_F', 'state_M'] 

X = train_set_hot[indeps_hot].values 
y = train_set_hot['cat_bins'].values 

print 'X-type:', X.shape, type(X) 
print 'y-type:', y.shape, type(y) 
print 'X has shape, is an array and has length:\n', hasattr(X, 'shape'), hasattr(X, '__array__'), hasattr(X, '__len__') 
print 'yhas shape, is an array and has length:\n', hasattr(y, 'shape'), hasattr(y, '__array__'), hasattr(y, '__len__') 
print 'X does have attribute fit:\n',hasattr(X, 'fit') 
print 'y does have attribute fit:\n',hasattr(y, 'fit') 

X-type: (50, 2) <type 'numpy.ndarray'> 
y-type: (50,) <type 'numpy.ndarray'> 
X has shape, is an array and has length: 
True True True 
yhas shape, is an array and has length: 
True True True 
X does have attribute fit: 
False 
y does have attribute fit: 
False 

所以,輸入到迴歸似乎具有用於.fit方法必要的屬性。他們是numpy陣列,形狀正確X是與尺寸[n_samples, n_features]陣列,並且y是具有形狀[n_samples,]這裏,向量的文檔:

擬合(X,Y,sample_weight =無)[源]

Fit the model according to the given training data. 
Parameters: 

X : {array-like, sparse matrix}, shape (n_samples, n_features) 

    Training vector, where n_samples is the number of samples and n_features is the number of features. 
    y : array-like, shape (n_samples,) 

Target vector relative to X. 

....

現在我們試圖以適應迴歸:

logmodel = LogisticRegressionCV(Cs =1, dual=False , scoring = accuracy_score, penalty = 'l2') 
logmodel.fit(X, y) 

... 

    TypeError: Expected sequence or array-like, got estimator LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, 
    intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, 
    penalty='l2', random_state=None, solver='liblinear', tol=0.0001, 
    verbose=0, warm_start=False) 

錯誤消息的來源似乎是在scikits的validation.py模塊中,here

是引發此錯誤信息的代碼的唯一部分是下面的函數 - 摘錄:

def _num_samples(x): 
    """Return number of samples in array-like x.""" 
    if hasattr(x, 'fit'): 
     # Don't get num_samples from an ensembles length! 
     raise TypeError('Expected sequence or array-like, got ' 
         'estimator %s' % x) 
    etc. 

問:因爲與我們擬合模型(Xy)參數不有屬性「適應」,這是爲什麼錯誤信息引發

冠層1.7.4.3348(64位)使用Python 2.7 scikit學習18.01-3和熊貓0.19.2-2

謝謝你的幫助:)

回答

1

這個問題似乎在scoring的論點。您已通過accuracy_scoreaccuracy_score的簽名是accuracy_score(y_true, y_pred[, ...])。但模塊logistic.py

if isinstance(scoring, six.string_types): 
    scoring = SCORERS[scoring] 
for w in coefs: 
    // Other code 
    if scoring is None: 
     scores.append(log_reg.score(X_test, y_test)) 
    else: 
     scores.append(scoring(log_reg, X_test, y_test)) 

既然你已經通過accuracy_score的,它不符合上述第一線。 和scores.append(scoring(log_reg, X_test, y_test))用於評估估計器。但正如我上面所說,這裏的參數不符合accuracy_score所需的參數。因此錯誤。

解決方法:使用make_scorer(accuracy_score)在LogisticRegressionCV的得分或者乾脆把這個字符串 '精度'

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
           scoring = make_scorer(accuracy_score), 
           penalty = 'l2') 

         OR 

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
           scoring = 'accuracy', 
           penalty = 'l2') 

注意:在logistic.py模塊的一部分

這可能是一個錯誤或者在LogisticRegressionCV的文檔中,他們應該澄清評分函數的簽名。

您可以提交an issue to the github and see how it goes完成

+0

謝謝你,無論你的建議避免錯誤。你能不能告訴我錯誤信息來源的哪部分源代碼。 – user2738815

+0

錯誤的來源與您在問題中指出的相同。但是它爲什麼會來,因爲評分函數提供了不正確的參數。從那裏提供了不正確的參數,我已經在第一個代碼片段的答案中顯示。 –

+0

我很欣賞你花時間。謝謝.. – user2738815