Scikit學習 - fit_transform對測試集

我很努力地在Python中使用隨機森林和Scikit學習。我的問題是，我使用它來進行文本分類（3類 - 正/負/中性），我提取的功能主要是單詞/單詞，所以我需要將它們轉換爲數字特徵。我找到了一種方法做它DictVectorizer的fit_transform：Scikit學習 - fit_transform對測試集

from sklearn.preprocessing import LabelEncoder 
from sklearn.metrics import classification_report 
from sklearn.feature_extraction import DictVectorizer 

vec = DictVectorizer(sparse=False) 
rf = RandomForestClassifier(n_estimators = 100) 
trainFeatures1 = vec.fit_transform(trainFeatures) 

# Fit the training data to the training output and create the decision trees 
rf = rf.fit(trainFeatures1.toarray(), LabelEncoder().fit_transform(trainLabels)) 

testFeatures1 = vec.fit_transform(testFeatures) 
# Take the same decision trees and run on the test data 
Output = rf.score(testFeatures1.toarray(), LabelEncoder().fit_transform(testLabels)) 

print "accuracy: " + str(Output)

我的問題是，fit_transform方法正在火車上的數據集，其中包含約8000實例，但是當我嘗試將我的測試設置數字功能太，這大約是80000分的情況下，我得到一個內存錯誤說：

testFeatures1 = vec.fit_transform(testFeatures) 
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 143, in fit_transform 
return self.transform(X) 
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 251, in transform 
Xa = np.zeros((len(X), len(vocab)), dtype=dtype) 
MemoryError

什麼可能導致這一點，有什麼解決方法嗎？非常感謝！

來源

2014-02-24 Crista23

你可以嘗試使用稀疏特徵？我不認爲應該需要toarray（）調用。 –

scikit-learn的RandomForestClassifier不會將稀疏矩陣作爲輸入。一種解決方案是將測試集分成一定大小的批次，然後對每個小批量運行預測。 – Matt

@rrenaud我也試圖通過創建vec對象作爲vec = DicVectorizer（）。它仍然沒有幫助.. – Crista23

你不應該在你的測試數據上做fit_transform，而只是transform。否則，你會得到不同於訓練期間使用的矢量化。

對於內存問題，我建議TfIdfVectorizer，它有許多減少維度的選項（通過刪除罕見的unigrams等）。

UPDATE

如果唯一的問題是裝修測試數據，只需將其拆分小塊。而不是像

x=vect.transform(test) 
eval(x)

你可以做

K=10 
for i in range(K): 
    size=len(test)/K 
    x=vect.transform(test[ i*size : (i+1)*size ]) 
    eval(x)

並記錄結果/統計和事後分析它們。

特別

predictions = [] 

K=10 
for i in range(K): 
    size=len(test)/K 
    x=vect.transform(test[ i*size : (i+1)*size ]) 
    predictions += rf.predict(x) # assuming it retuns a list of labels, otherwise - convert it to list 

print accuracy_score(predictions, true_labels)

來源

2014-02-25 06:50:07 lejlot

Scikit學習 - fit_transform對測試集

回答

相關問題