使用scikit-learn處理分類特徵

我正在解決使用隨機森林的分類問題。我有一組固定長度的字符串（10個字符長），代表DNA序列。 DNA字母由4個字母組成，即A,C,G,T。

這是我的原始數據的樣本：

ATGCTACTGA 
ACGTACTGAT 
AGCTATTGTA 
CGTGACTAGT 
TGACTATGAT

每個DNA序列附帶的實驗數據描述一個真正的生物反應;該分子被認爲引起生物反應（1），或不（0）。

問題：

訓練集由兩者的，分類（名義）和數值的特徵。它是以下結構：

training_set = [ 
    {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
    'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
    'mass':370.2, 'temp':70.0}, 
    {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
    'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
    'mass':400.3, 'temp':67.2}, 
] 

target = [1, 0]

我成功地使用DictVectorizer類編碼標稱功能創建的分類，但我有在執行我的測試數據的預測問題。

下面是我的代碼的簡化版本，至今完成：

from sklearn.ensemble import RandomForestClassifier 
from sklearn.feature_extraction import DictVectorizer 

training_set = [ 
    {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
    'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
    'mass':370.2, 'temp':70.0}, 
    {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
    'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
    'mass':400.3, 'temp':67.2}, 
] 

target = [1, 0] 

vec = DictVectorizer() 
train = vec.fit_transform(training_set).toarray() 

clf = RandomForestClassifier(n_estimators=1000) 
clf = clf.fit(train, target) 


# The following part fails. 
test_set = { 
    'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
    'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
    'mass':370.2, 'temp':70.0} 
vec = DictVectorizer() 
test = vec.fit_transform(test_set).toarray() 
print clf.predict_proba(test)

其結果是，我得到了一個錯誤：

ValueError: Number of features of the model must match the input. 
Model n_features is 20 and input n_features is 12

來源

2014-01-26 sherlock85

可能的複製te [如何強制scikit-learn DictVectorizer不放棄功能？]（http://stackoverflow.com/questions/19770147/how-to-force-scikit-learn-dictvectorizer-not-to-discard-features） –

您需要使用創建的列車一樣DictVectorizer對象數據集到transform的test_set：

from sklearn.ensemble import RandomForestClassifier 
from sklearn.feature_extraction import DictVectorizer 

training_set = [ 
    {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
    'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
    'mass':370.2, 'temp':70.0}, 
    {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
    'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
    'mass':400.3, 'temp':67.2}, 
] 

target = [1, 0] 

vec = DictVectorizer() 
train = vec.fit_transform(training_set).toarray() 

clf = RandomForestClassifier(n_estimators=1000) 
clf = clf.fit(train, target) 


# The following part fails. 
test_set = { 
    'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
    'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
    'mass':370.2, 'temp':70.0} 

test = vec.transform(test_set).toarray() 
print clf.predict_proba(test)

來源

2014-01-26 10:52:19 HYRY

謝謝，它工作得很好。但是，我注意到，處理大量的字符串，使得矩陣非常寬，並使我的記憶過載。我想知道，如果你能建議我用其他方法來創建分類器。在Scikit學習文檔中，我閱讀了[功能散列]（http://scikit-learn.org/stable/modules/feature_extraction.html），但我找不到在我的數據上使用它的方法。 – sherlock85

@s_sherly要使'FeatureHasher'工作，您需要自己用虛擬變量替換分類特徵：'「p1 = A」：1'等。但是，使用特徵選擇和/或降維可能更好在矢量化器出來的稀疏矩陣上的TruncatedSVD。 –

使用scikit-learn處理分類特徵

回答

相關問題