如何在scikit-learn中將數字特徵與文字（字袋）正確結合？

我正在爲網頁編寫分類器，所以我有數字特徵的混合，並且我還想分類文本。我正在使用詞袋方法將文本轉換爲（大）數值向量。代碼結果是這樣的：如何在scikit-learn中將數字特徵與文字（字袋）正確結合？

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer 
import numpy as np 

numerical_features = [ 
    [1, 0], 
    [1, 1], 
    [0, 0], 
    [0, 1] 
] 
corpus = [ 
    'This is the first document.', 
    'This is the second second document.', 
    'And the third one', 
    'Is this the first document?', 
] 
bag_of_words_vectorizer = CountVectorizer(min_df=1) 
X = bag_of_words_vectorizer.fit_transform(corpus) 
words_counts = X.toarray() 
tfidf_transformer = TfidfTransformer() 
tfidf = tfidf_transformer.fit_transform(words_counts) 

bag_of_words_vectorizer.get_feature_names() 
combinedFeatures = np.hstack([numerical_features, tfidf.toarray()])

這樣的工作，但我關心的準確性。請注意，有4個對象，只有兩個數字特徵。即使是最簡單的文本也能得到具有9個特徵的向量（因爲在語料庫中有9個不同的單詞）。顯然，在真實文本中，將會有數百或數千個不同的單詞，所以最終的特徵向量將是數字特徵，但是基於1000個以上的單詞。由於這個原因，分類器（SVM）不會將數字特徵上的單詞以100到1的比例嚴重加權嗎？如果是這樣，我該如何補償，以確保單詞包的權重等於數字特徵？

來源

2016-09-12 Phenglei Kai

您可以使用Scikit學習中的TruncatedSVD來降低單詞特徵的維度。 http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html – aberger

你找到了解決方法嗎？我正在與Spark做類似的事情。 – schoon

您可以通過使用Tf–idf加權計數：

import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 

np.set_printoptions(linewidth=200) 

corpus = [ 
    'This is the first document.', 
    'This is the second second document.', 
    'And the third one', 
    'Is this the first document?', 
] 

vectorizer = CountVectorizer(min_df=1) 
X = vectorizer.fit_transform(corpus) 

words = vectorizer.get_feature_names() 
print(words) 
words_counts = X.toarray() 
print(words_counts) 

transformer = TfidfTransformer() 
tfidf = transformer.fit_transform(words_counts) 
print(tfidf.toarray())

輸出是這樣的：

# words 
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this'] 

# words_counts 
[[0 1 1 1 0 0 1 0 1] 
[0 1 0 1 0 2 1 0 1] 
[1 0 0 0 1 0 1 1 0] 
[0 1 1 1 0 0 1 0 1]] 

# tfidf transformation 
[[ 0.   0.43877674 0.54197657 0.43877674 0.   0.   0.35872874 0.   0.43877674] 
[ 0.   0.27230147 0.   0.27230147 0.   0.85322574 0.22262429 0.   0.27230147] 
[ 0.55280532 0.   0.   0.   0.55280532 0.   0.28847675 0.55280532 0.  ] 
[ 0.   0.43877674 0.54197657 0.43877674 0.   0.   0.35872874 0.   0.43877674]]

有了這表示你應該能夠合併進一步二進制功能訓練SVC。

來源

2016-09-12 08:19:35

除非我錯過了一些東西，這與我發佈的內容沒有什麼不同。我已經有了一個TfidfTransformer實例，並且我已經調用了fit_transform。我的問題是得到的向量長度是4個項目x 9個特徵，並且這對於實際文本來說會更大，因爲每個不同的單詞映射到特徵。我不希望這會壓倒真正的數字特徵。 –

你是對的。在下一步中，你應該用一堆數據訓練一個非線性分類器（例如帶'kernel'的SVC：'rbf'，'poly'）。另外我發現這個有用的[線程]（https://www.quora.com/What-are-good-ways-to-handle-discrete-and-continuous-inputs-together）。 –

如何在scikit-learn中將數字特徵與文字（字袋）正確結合？

回答

相關問題