如何使用scikit向量化帶標籤的bigrams？

我自學的是如何使用scikit-learn，我決定用自己的語料庫開始second task。我的手得到了一些二元語法，讓我們說：如何使用scikit向量化帶標籤的bigrams？

training_data = [[('this', 'is'), ('is', 'a'),('a', 'text'), 'POS'], 
[('and', 'one'), ('one', 'more'), 'NEG'] 
[('and', 'other'), ('one', 'more'), 'NEU']]

我想向量化他們是很好的可以通過scikit學習提供了一些分類算法填寫的格式（SVC，多項式的樸素貝葉斯等）。這是我的嘗試：

from sklearn.feature_extraction.text import CountVectorizer 

count_vect = CountVectorizer(analyzer='word') 

X = count_vect.transform(((' '.join(x) for x in sample) 
        for sample in training_data)) 

print X.toarray()

這樣做的問題是，我不知道如何處理的標籤（即'POS', 'NEG', 'NEU'），我是否需要「矢量化」的標籤，也爲了打發training_data到分類算法，或者我可以讓它像'POS'或任何其他類型的字符串？另一個問題是，我得到這個：

raise ValueError("Vocabulary wasn't fitted or is empty!") 
ValueError: Vocabulary wasn't fitted or is empty!

所以，我怎麼能向量化二元語法像training_data。我也讀到dictvectorizer和Sklearn-pandas，你們認爲使用它們可能會更好地解決這個問題嗎？

來源

2014-12-13 tumbleweed

它應該是這樣的：

>>> training_data = [[('this', 'is'), ('is', 'a'),('a', 'text'), 'POS'], 
       [('and', 'one'), ('one', 'more'), 'NEG'], 
       [('and', 'other'), ('one', 'more'), 'NEU']] 
>>> count_vect = CountVectorizer(preprocessor=lambda x:x, 
           tokenizer=lambda x:x) 
>>> X = count_vect.fit_transform(doc[:-1] for doc in training_data) 

>>> print count_vect.vocabulary_ 
{('and', 'one'): 1, ('a', 'text'): 0, ('is', 'a'): 3, ('and', 'other'): 2, ('this', 'is'): 5, ('one', 'more'): 4} 
>>> print X.toarray() 
[[1 0 0 1 0 1] 
[0 1 0 0 1 0] 
[0 0 1 0 1 0]]

然後把你的標籤在目標變量：

y = [doc[-1] for doc in training_data] # ['POS', 'NEG', 'NEU']

現在，你可以訓練一個模型：

model = SVC() 
model.fit(X, y)

來源

2014-12-13 03:02:05 elyase

我其實一直用這種方式來登記標籤。問題是我有一個更大的bigrams列表，它看起來不清楚Scikit如何使用標籤來學習和預測一些結果。是否有另一種python的方式來設置標籤，而不是逐行執行？謝謝！ – tumbleweed 2014-12-13 03:07:13

是的，更新了我的答案，還修復了'CountVectorizer'調用，以便它不會預處理或標記您的bigrams。 – elyase 2014-12-13 03:12:58

你的代碼有幾個小錯誤，我建議你打開一個新的問題，關於你現在得到的錯誤和你將要得到的錯誤（提示：比較你的代碼爲我的標籤'y'） – elyase 2014-12-13 14:36:33

如何使用scikit向量化帶標籤的bigrams？

回答

相關問題