內存錯誤,我嘗試使用scikit學習,用於對輸入文本string.I預測值現在用HashingVectorizer數據量化和PassiveAggressiveClassifier使用partial_fit學習(參閱下面的代碼):與分類配合和partial_fit
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.metrics import zero_one_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier, Perceptron
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pickle
a,r = [],[]
vectorizer = TfidfVectorizer()
with open('val', 'rb') as f:
r = pickle.load(f)
with open('text', 'rb') as f:
a = pickle.load(f)
L = (vectorizer.fit_transform(a))
training_set = L[:3250]
testing_set = L[3250:]
M = np.array(r)
training_result = M[:3250]
testing_result = M[3250:]
cls = np.unique(r)
model = PassiveAggressiveClassifier()
model.partial_fit(training_set, training_result, classes=cls)
print(model)
predicted = model.predict(testing_set)
print testing_result
print predicted
錯誤日誌:
File "try.py", line 89, in <module>
model.partial_fit(training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 374, in _partial_fit
coef_init, intercept_init)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 167, in _allocate_parameter_mem
dtype=np.float64, order="C")
MemoryError
我以前使用CountVectorizer和邏輯迴歸分類和工作沒有問題。 但我的學習數據是約。數百萬行,我想執行增量學習使用上述腳本,這是導致內存錯誤每次執行。
UPDATE:
在環施加局部學習之後,partial_fit函數返回不匹配的號碼特徵的錯誤(ValueError: Number of features 8897 does not match previous data 9190.
) 另外,即使我設置最大特徵屬性,那麼所產生的預測是不正確的。 有沒有什麼方法可以使partial_fit方法獲得可變數量的特徵?
執行輸出:
(400, 8481)
(400, 9277)
Traceback (most recent call last):
File "f9.py", line 65, in <module>
training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 379, in _partial_fit
% (n_features, self.coef_.shape[-1]))
ValueError: Number of features 9277 does not match previous data 8481.
任何幫助將不勝感激。
謝謝!
做更新後:你能不能給我們多一點的代碼。它什麼時候崩潰?幾次partial_fit之後,還是第二次?你可以打印你的不同變量的形狀(集合和結果) – RPresle
用崩潰記錄更新了問題。 –
在我看來,問題可能來自哈希矢量化器。但是我需要查看所有的代碼以確定可能的原因。另外,我們需要更多關於執行過程中的錯誤和陣列形狀的細節。由於它顯然會遺留內存錯誤或partial_fit的範圍,因此請考慮再次提出問題並在此處發佈鏈接。 – RPresle