我正在嘗試解決分類問題。當我喂文本CountVectorizer提示錯誤:Counterctorizer中的TypeError scikit-learn:預期的字符串或緩衝區
expected string or buffer.
什麼不對的,因爲它包含數和文字,甚至特殊字符的消息混合物,也是我的信息數據集。
樣品怎樣消息看起來是以下幾點:
0 I have not received my gifts which I ordered ok
1 hth her wells idyll McGill kooky bbc.co
2 test test test 1 test
3 test
4 hello where is my reward points
5 hi, can you get koovs coupons or vouchers here...
這裏是我以前做的代碼分類:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_excel('training_data.xlsx')
X_train = df.message
print X_train.shape
map_class_label = {'checkin':0, 'greeting':1,'more reward options':2,'noclass':3, 'other':4,'points':5,
'referral points':6,'snapbill':7, 'thanks':8,'voucher not working':9,'voucher':10}
df['label_num'] = df['Final Category'].map(map_class_label)
y_train = df.label_num
vectorizer = CountVectorizer(lowercase=False,decode_error='ignore')
X_train_dtm = vectorizer.fit_transform(X_train)
@jez rael最終分類是對應於每個消息的類標籤(文本數據),我通過映射到label_num列更改爲數值。它沒有丟失在我剛剛沒有顯示的數據集中。因爲當我嘗試使用countvectorizer擬合和轉換消息時發生問題。 –
而我的解決方案是否有效?由於UnicodeEncodeError錯誤, – jezrael