0
我有一個數據集,每個文檔都有一個標籤,如下例所示。單標籤數據集中的多標籤文本分類
label text
pay "i will pay now"
finance "are you the finance guy?"
law "lawyers and law"
court "was at the court today"
finance report "bank reported annual share.."
該文本文檔可以標記多個標籤,所以我怎麼能做這個數據集的多標籤分類?我已經閱讀了sklearn
的大量文檔,但似乎無法找到在單標籤數據集上進行多標籤分類的正確方法。預先感謝您的幫助。
到目前爲止,這是我所:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn import preprocessing
loc = r'C:\Users\..\Downloads\excel.xlsx'
df = pd.read_excel(loc)
X = np.array(df.docs)
z = np.array(df.title)
y = np.array(df.raw)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform(y_train)
Y_test = mlb.fit_transform(y_test)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
doc_new = np.array(['X has announced that it will sell $587 million'])
print("Accuracy Score: ", accuracy_score(Y_test, predicted))
print(mlb.inverse_transform(classifier.predict(doc_new)))
但我不斷收到一個尺寸誤差:
.format(len(self.classes_), yt.shape[1]))ValueError: Expected indicator for 44 classes, but got 46