FeatureUnion具有不同的特徵尺寸

我想用sklearn分類一些句子。句子存儲在Pandas DataFrame中。FeatureUnion具有不同的特徵尺寸

首先，我想用一句話的長度，它的TF-IDF向量作爲特徵，所以我創造了這個管道：

pipeline = Pipeline([ 
    ('features', FeatureUnion([ 
     ('meta', Pipeline([ 
      ('length', LengthAnalyzer()) 
     ])), 
     ('bag-of-words', Pipeline([ 
      ('tfidf', TfidfVectorizer()) 
     ])) 
    ])), 
    ('model', LogisticRegression())

其中LengthAnalyzer是一個自定義TransformerMixin有：

def transform(self, documents): 
     for document in documents: 
      yield len(document)

所以，LengthAnalyzer返回一個數字（1維），而TfidfVectorizer返回一個n維列表。

當我嘗試運行此，我得到

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 494, expected 1.

了什麼工作要做，以使此功能結合工作？

來源

2017-10-13 Mirco

將該數字轉換爲形狀的二維數組[1,1] –

像np.array（len（document））。reshape（-1,1）？同樣的錯誤 – Mirco

似乎問題來自於transform（）中使用的yield。可能由於yield，報告給scipy hstack方法的行數是1，而不是documents中的實際樣本數。

您的數據中應該有494行（樣本）來自TfidfVectorizer，但LengthAnalyzer只報告一行。因此錯誤。

如果你可以把它改成

return np.array([len(document) for document in documents]).reshape(-1,1)

那麼管道順利適應。

說明：我試過在scikit-learn github上發現任何相關問題但不成功。你可以在這裏發佈這個問題來獲得一些真實的使用反饋。

來源

2017-10-13 14:02:07

就像一個魅力，謝謝！ – Mirco

FeatureUnion具有不同的特徵尺寸

回答

相關問題