2017-06-19 61 views
0

我有一個數據集,每個文檔都有一個標籤,如下例所示。單標籤數據集中的多標籤文本分類

label   text 

    pay   "i will pay now" 
    finance  "are you the finance guy?" 
    law   "lawyers and law" 
    court   "was at the court today" 
    finance report "bank reported annual share.." 

該文本文檔可以標記多個標籤,所以我怎麼能做這個數據集的多標籤分類?我已經閱讀了sklearn的大量文檔,但似乎無法找到在單標籤數據集上進行多標籤分類的正確方法。預先感謝您的幫助。

到目前爲止,這是我所:

import numpy as np 
import pandas as pd 
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.linear_model import SGDClassifier 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import MultiLabelBinarizer 
from sklearn import preprocessing 

loc = r'C:\Users\..\Downloads\excel.xlsx' 

df = pd.read_excel(loc) 
X = np.array(df.docs) 
z = np.array(df.title) 
y = np.array(df.raw) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
random_state=42) 

mlb = preprocessing.MultiLabelBinarizer() 
Y = mlb.fit_transform(y_train) 
Y_test = mlb.fit_transform(y_test) 

classifier = Pipeline([ 
('vectorizer', CountVectorizer()), 
('tfidf', TfidfTransformer()), 
('clf', OneVsRestClassifier(LinearSVC()))]) 

    classifier.fit(X_train, Y) 
    predicted = classifier.predict(X_test) 

doc_new = np.array(['X has announced that it will sell $587 million']) 

print("Accuracy Score: ", accuracy_score(Y_test, predicted)) 
print(mlb.inverse_transform(classifier.predict(doc_new))) 

但我不斷收到一個尺寸誤差:

.format(len(self.classes_), yt.shape[1]))ValueError: Expected indicator for 44 classes, but got 46

回答

0

我富爾德的解決方案。我用熊貓GroupBy

df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index() 

組合文本與多個類在一起,它的工作。

尺寸誤差也已經解決:dimension error