如何使用sklearn將第二個特徵添加到countvectorized特徵？

我有我的數據集3列：如何使用sklearn將第二個特徵添加到countvectorized特徵？

評論：產品的意見

類型：類別或產品類型

成本：多少產品成本

這是一個多類問題，Type爲目標變量。該數據集中有64種不同類型的產品。

評論和費用是我的兩個特點。

我已經分割數據爲4套與類型變量刪除：

X = data.drop('type', axis = 1) 
y = data.type 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

對於評論，我使用下面的向量化它：

vect = CountVectorizer(stop_words = stop) 
X_train_dtm = vect.fit_transform(X_train.review)

這裏的我被卡住了！

爲了運行我需要有我的兩個特點，在訓練集，但由於X_train_dtm是一個稀疏矩陣模型，我不清楚，我怎麼串聯我的熊貓系列成本功能，將這些稀疏矩陣。由於數據已經爲數字成本，我不認爲我需要轉換它，這就是爲什麼我沒有使用像「FeatureUnion」，它結合了2個轉換的功能。

任何幫助將不勝感激！

示例數據：

| Review   | Cost  | Type   | 
|:-----------------|------------:|:------------:| 
| This is a review |  200 |  Toy  
| This is a review |  100 |  Toy  
| This is a review |  800 | Electronics  
| This is a review |   35 |  Home

更新

應用tarashypka的解決方案，我能夠擺脫添加第二個功能將X_train_dtm後。不過，我試圖在其上運行測試集相同的時候得到一個錯誤：從scipy.sparse進口hstack

vect = CountVectorizer(stop_words = stop) 
X_train_dtm = vect.fit_transform(X_train.review) 
prices = X_train.prices.values[:,None] 
X_train_dtm = hstack((X_train_dtm, prices)) 

#Works perfectly for the training set above 
#But when I run with test set I get the following error 
X_test_dtm = vect.transform(X_test) 
prices_test = X_test.prices.values[:,None] 
X_test_dtm = hstack((X_test_dtm, prices_test)) 

Traceback (most recent call last): 

    File "<ipython-input-10-b2861d63b847>", line 8, in <module> 
    X_test_dtm = hstack((X_test_dtm, points_test)) 

    File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 464, in hstack 
    return bmat([blocks], format=format, dtype=dtype) 

    File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 581, in bmat 
    'row dimensions' % i) 

ValueError: blocks[0,:] has incompatible row dimensions

來源

2017-07-16 Negative Correlation

你能上傳你的數據嗎？ – sera

@sera我添加了一些示例數據，謝謝！ –

的CountVectorizer結果，你的情況X_train_dtm，是scipy.sparse.csr_matrix類型。如果你不希望將其轉換爲numpy的陣列，然後scipy.sparse.hstack是添加另一列

>> from scipy.sparse import hstack 
>> prices = X_train['Cost'].values[:, None] 
>> X_train_dtm = hstack((X_train_dtm, prices))

來源

2017-07-16 22:44:37 tarashypka

感謝您的幫助，但我收到以下錯誤： ValueError：blocks [0 ,:]具有不兼容的行尺寸 –

是的，這是因爲數據和X_train_dtm具有不同的行數，因爲第二個是分割第一個一。我更新了答案，現在檢查它。 – tarashypka

謝謝！這工作，但是當我試圖在測試集上製作堆棧時遇到了相同的塊錯誤。我已經更新了原來的問題。 –

使用FeatureUnion到hstack事情對你的方式。 example on heterogeneous data很像你的問題。

來源

2017-07-17 02:33:07 joeln

如何使用sklearn將第二個特徵添加到countvectorized特徵？

回答

相關問題