2017-02-09 58 views
3

我在做什麼錯在這裏?我有我想要使用執行部分適合大型數據集Scikit學習的SGDClassifierSKlearn SGD部分適合

我下面

from sklearn.linear_model import SGDClassifier 
import pandas as pd 

chunksize = 5 
clf2 = SGDClassifier(loss='log', penalty="l2") 

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True): 
    X = train_df[features_columns] 
    Y = train_df["clicked"] 
    clf2.partial_fit(X, Y) 

,我發現了錯誤

Traceback (most recent call last): File "/predict.py", line 48, in sys.exit(0 if main() else 1) File "/predict.py", line 44, in main predict() File "/predict.py", line 38, in predict clf2.partial_fit(X, Y) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 512, in partial_fit coef_init=None, intercept_init=None) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 349, in _partial_fit _check_partial_fit_first_call(self, classes) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/utils/multiclass.py", line 297, in _check_partial_fit_first_call raise ValueError("classes must be passed on the first call " ValueError: classes must be passed on the first call to partial_fit.

+1

」所有調用partial_fit的類都可以通過np.unique(y_all)獲得,其中y_all是整個數據集的目標向量,該參數對第一次調用partial_fit是必需的,可以在後續調用請注意,y不需要包含類中的所有標籤。「 http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit – 2017-02-09 21:37:32

+1

@JackManey請發表您的評論作爲答案,以便提問者可以接受和/或者關閉該問題。 –

回答

2

請注意分類器在開始時並不知道類的數量,因此對於第一遍,您需要使用np.unique(target)來告訴類的數量,其中target是類列。因爲您正在以塊的形式讀取數據,所以您需要確保第一個塊有類標籤的所有可能值,因此它可以工作!因此,您的代碼將是:

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True): 
    X = train_df[features_columns] 
    Y = train_df["clicked"] 
    clf2.partial_fit(X, Y, classes=np.unique(Y))