2014-09-26 24 views
2

我想使用SVC分類器繪製學習曲線。數據集有點偏斜,大小約爲150,1000,1000,1000和150。我遇到的問題與擬合估算:scikit-learn:使用SVC構建學習曲線

File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/learning_curve.py", line 135, in learning_curve 
    for train, test in cv for n_train_samples in train_sizes_abs) 
    File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 644, in __call__ 
    self.dispatch(function, args, kwargs) 
    File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 391, in dispatch 
    job = ImmediateApply(func, args, kwargs) 
    File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__ 
    self.results = func(*args, **kwargs) 
    File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1233, in _fit_and_score 
    estimator.fit(X_train, y_train, **fit_params) 
    File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit 
    X = atleast2d_or_csr(X, dtype=np.float64, order='C') 
    File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 450, in _validate_targets 
    % len(cls)) 
ValueError: The number of classes has to be greater than one; got 1 

我的代碼

df = pd.read_csv('../resources/problem2_processed_validate.csv') 
    data, label = preprocess_text(df) 

    cv = StratifiedKFold(label, 10) 
    plt = plot_learning_curve(estimator=SVC(), title="Learning curve", X=data, y=label.values, cv 
    train_sizes, train_scores, test_scores = learning_curve(
    estimator, data, y=label, cv=cv, train_sizes=np.linspace(.1, 1.0, 5)) 

即使我使用分層抽樣,我仍然會碰到這個錯誤。我相信它是因爲學習曲線代碼在增加數據集大小時不會執行分層,而且我已經在一個步驟中獲得了所有類似的類標籤。

我應該如何解決?

回答

3

您可以使用StratifiedShuffleSplit而不是StratifiedKFold,然後自己編寫學習曲線循環,在每次迭代中創建一個新的CV對象。 StratifiedShuffleSplit允許您指定一個train_size和一個test_size,您可以在創建學習曲線時增加。只要你讓train_size大於類的數量,它將能夠分層。

0

你說得對。 learning_curve在創建較小的數據集時不會執行分層,只需要數據的第一位。行134-136在learning_curve.py

train[:n_train_samples] for n_train_samples in train_sizes_abs 

您可以隨機播放你的數據提前,從而使切片train[:n_train_samples]可以(但並不保證)包括所有類別的數據點。如果你願意做更多的工作,@eickenberg提出的建議將會起作用。

PS這聽起來像應該包含在sklearn中的東西。如果你最終寫的代碼,請發送github上的請求

+0

謝謝!我會看看我能否貢獻。 – goh 2014-10-08 09:05:05