2017-08-18 31 views
0

我是機器學習中的新魚。最近我遇到了一個問題,並且已經對同一主題搜索了StackOverFlow,但仍然無法弄清楚。有人可以看看嗎?非常感謝!python sklearn:IndexError:'數組太多索引'

#-*- coding:utf-8 -*- 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

data_train = pd.read_excel('py_train.xlsx',index_col=0) 
test_data = pd.read_excel('py_test.xlsx',index_col=0) 


from sklearn import preprocessing 

x = data_train.iloc[:,1:].as_matrix() 
y = data_train.iloc[:,0:1].as_matrix() 

sx = preprocessing.scale(x) 

from sklearn import linear_model 
clf = linear_model.LogisticRegression() 
clf.fit(sx,y) 

clf 

代碼運行良好,數據全部清理完畢。我適合的數據,如:

id rep a b c d 
1 0 1 2 3 4 
2 0 2 3 4 5 
3 0 3 4 5 6 
4 1 4 5 6 7 
5 1 5 6 7 8 
6 1 6 7 8 9 
7 1 7 8 9 10 
8 1 8 9 10 11 
9 1 9 10 11 12 
10 1 10 11 12 13 

和下面的代碼顯示一個IndexError。爲什麼?我該如何解決它?

謝謝!

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.learning_curve import learning_curve 


def plot_learning_curve(estimator, title, x, y, ylim=None, cv=None, n_jobs=1, 
         train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True): 

    train_sizes, train_scores, test_scores = learning_curve(
     estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose) 

    train_scores_mean = np.mean(train_scores, axis=1) 
    train_scores_std = np.std(train_scores, axis=1) 
    test_scores_mean = np.mean(test_scores, axis=1) 
    test_scores_std = np.std(test_scores, axis=1) 

    if plot: 
     plt.figure() 
     plt.title(title) 
     if ylim is not None: 
      plt.ylim(*ylim) #ylim=y's limit 
     plt.xlabel(u"train set size") 
     plt.ylabel(u"score") 
     plt.gca().invert_yaxis() 
     plt.grid() #網格 

     plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, 
         alpha=0.1, color="b")  # generates a shaded region 
     plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, 
         alpha=0.1, color="r") 
     plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train set score")  
     plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"CV score") 

     plt.legend(loc="best") 

     plt.draw() 
     plt.gca().invert_yaxis() 
     plt.show() 

    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1]))/2 
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1]) 
    return midpoint, diff 

plot_learning_curve(clf, u"learning_curve", x, y) 

完整信息:

--------------------------------------------------------------------------- 
IndexError        Traceback (most recent call last) 
<ipython-input-18-0dc3d0934602> in <module>() 
    42  return midpoint, diff 
    43 
---> 44 plot_learning_curve(clf, u"learning_curve", x, y) 

<ipython-input-18-0dc3d0934602> in plot_learning_curve(estimator, title, x, y, ylim, cv, n_jobs, train_sizes, verbose, plot) 
     8 
     9  train_sizes, train_scores, test_scores = learning_curve(
---> 10   estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose) 
    11 
    12  train_scores_mean = np.mean(train_scores, axis=1) 

D:\Anaconda3\lib\site-packages\sklearn\learning_curve.py in learning_curve(estimator, X, y, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, error_score) 
    138  X, y = indexable(X, y) 
    139  # Make a list since we will be iterating multiple times over the folds 
--> 140  cv = list(check_cv(cv, X, y, classifier=is_classifier(estimator))) 
    141  scorer = check_scoring(estimator, scoring=scoring) 
    142 

D:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in check_cv(cv, X, y, classifier) 
    1821   if classifier: 
    1822    if type_of_target(y) in ['binary', 'multiclass']: 
-> 1823     cv = StratifiedKFold(y, cv) 
    1824    else: 
    1825     cv = KFold(_num_samples(y), cv) 

D:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in __init__(self, y, n_folds, shuffle, random_state) 
    567   for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)): 
    568    for label, (_, test_split) in zip(unique_labels, per_label_splits): 
--> 569     label_test_folds = test_folds[y == label] 
    570     # the test split can be too big because we used 
    571     # KFold(max(c, self.n_folds), self.n_folds) instead of 

IndexError: too many indices for array 
+0

您的數據看起來是否與問題中的表格完全相同? –

+0

@MaximilianPeters是的,我的表中的數據都是int類型,只是有更多的列,與示例沒有多大差別。錯誤是相同的 – Lucy

回答

0

Logistic迴歸接受和交叉驗證似乎只接受爲y值陣列。你似乎傳遞矩陣

檢查的區別:

你路過這兒:

df.iloc[:,0:1].as_matrix() 
array([[0], 
     [1], 
     [2]], dtype=int64) 

但它可能是最好使用

df.iloc[:,0].as_matrix() 
array([0, 1, 2], dtype=int64) 

你能試試嗎?

+0

非常感謝你,你指出了爲什麼以及如何做。我試試你的但現在卻出現了另一個錯誤:ValueError:這個求解器需要數據中至少有兩個類的樣本,但數據只包含一個類:0 – Lucy

+0

我把這些代碼放在def之前,問題與此類似:https://stackoverflow.com/questions/40524790/valueerror-this-solver-needs-samples-of-at-least-2-classes-in-the-data-but-the,也做了100行例如,錯誤仍然在發生data_train = shuffle(data_train) x = data_train.iloc [:,1:]。as_matrix() y = data_train.iloc [:,0] .as_matrix() – Lucy

+0

thanks a million.problem has已完成 – Lucy