我想對Sci Kit Learn中的unigrams做一些分析。我創建了svmlight格式的文件並試圖運行MultinomialNB() KNeighborsClassifier() and SVC()
。我們首先試圖用unigrams做到這一點,但我得到了X培訓維度錯誤,可能是因爲在給定示例中包含的唯一unigrams是在那裏的培訓中出現的唯一unigrams。我試圖創建svmlight格式的培訓文件,其中包括每個在語料庫中查看的每個單元的佔位符,即使那些不在該給出的示例中。用Scikit學習Unigram分析
問題是將訓練文件從3 MB擴大到300 MB。這導致sklearn加載文件的內存錯誤。有沒有辦法繞過維度不匹配或內存溢出。
X_train, y_train= load_svmlight_file(trainFile)
x_test, y_test = load_svmlight_file(testFile)
try:
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('Input data: ' + trainFile.split('.')[0])
print('naive_bayes')
print('accuracy: ' + str(accuracy_score(y_test, preds)))
if 1 in preds:
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))
except Exception as inst:
print 'fail in NB ' + 'Input data: ' + trainFile.split('.')[0]
print str(inst)
pass
2828測試實例,並與18000個不同unigram進行
編輯1212個測試的例子,我試圖用sklearn CountVectorizer
但我仍然得到內存的問題。這是做這件事的最好方法嗎?
def fileLoadForPipeline(trainSetFile, valSetFile):
with open(trainSetFile) as json_file:
tdata = json.load(json_file)
with open(valSetFile) as json_file:
vdata = json.load(json_file)
x_train = []
x_val = []
y_train = []
y_val = []
for t in tdata:
x_train.append(t['request_text'])
y_train.append(t['requester_received_pizza'])
for v in vdata:
x_val.append(t['request_text'])
y_val.append(t['requester_received_pizza'])
return x_train, y_train, x_val, y_val
def buildPipeline(trainset, valset, norm):
x_train, y_train, x_val, y_val = fileLoadForPipeline(trainset, valset)
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=ur'\b\w+\b', min_df=1)
xT = bigram_vectorizer.fit_transform(x_train).toarray()
xV = bigram_vectorizer.fit_transform(x_val).toarray()
if norm:
transformer = TfidfTransformer()
xT = transformer.fit_transform(xT)
xV = transformer.fit_transform(xV)
results = []
for clf, name in ((Perceptron(n_iter=50), "Perceptron"),
(KNeighborsClassifier(n_neighbors=40), "kNN"), (MultinomialNB), (MultinomialNB(alpha=.01),'MultinomialNB'),
(BernoulliNB(alpha=.1),'BernoulliNB'),(svm.SVC(class_weight='auto'),'svc')):
print 80 * '='
print name
results.append(benchmark(clf))
你可以發佈X_train,y_train,x_test和y_test的長度和尺寸以及你得到的錯誤嗎? – user823743 2014-12-07 01:23:20
@ user823743添加了 – 2014-12-07 01:28:17
我的意思是如果您可以在代碼中打印尺寸並將其張貼在這裏?或者在分配這些數組時遇到錯誤?和控制檯上的錯誤是什麼? – user823743 2014-12-07 01:34:36