-1
時,你得到的類名後面我有一個CSV文件看起來像這樣:我如何使用MultiLabelBinarizer
target,data
AAA,some text document
AAA;BBB,more text
AAC,more text
下面是代碼:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.naive_bayes import BernoulliNB
import pandas as pd
pdf = pd.read_csv("Train.csv", sep=',')
pdfT = pd.read_csv("Test.csv", sep=',')
X1 = pdf['data']
Y1 = [[t for t in tar.split(';')] for tar in pdf['target']]
X2 = pdfT['data']
Y2 = [[t for t in tar.split(';')] for tar in pdfT['target']]
# Vectorizer data
hv = HashingVectorizer(stop_words='english', non_negative=True)
X1 = hv.transform(X1)
X2 = hv.transform(X2)
mlb = MultiLabelBinarizer()
mlb.fit(Y1+Y2)
Y1 = mlb.transform(Y1)
# mlb.classes_ looks like ['AAA','AAC','BBB',...] len(mlb.classes_)==1363
# Y1 looks like [[0,0,0,....0,0,0], ... ] now
# fit
clsf = OneVsRestClassifier(BernoulliNB(alpha=.001))
clsf.fit(X1,Y1)
# predict_proba
proba = clsf.predict_proba(X2)
# want to get class names back
classnames = mlb.inverse_transform(clsf.classes_) # booom, shit happens
for i in range(len(proba)):
# get classnames,probability dict
preDict = dict(zip(classnames, proba[i]))
# sort dict by probability value, print actual and top 5 predict results
print(Y2[i], dict(sorted(preDict.items(),key=lambda d:d[1],reverse=True)[0:5]))
問題是clsf.fit後( X1,Y1) clsf.classes_是一個int數組[0,1,2,3,... 1362]
爲什麼它不像Y1?我如何從clsf.classes_獲取類名? mlb.classes_ == clsf.classes_或不是,具有相同的順序?
謝謝! 'label_binarizer_'正是我需要的 'bitarray = clsf.label_binarizer_.inverse_transform(PROBA,閾值= 0.5)' 然後 '類名= mlb.inverse_transform(bitarray)' 但clsf.predict_proba(X2)似乎對回報的概率例如,每個二進制文件 – Leowan
,即'[('AAA','BBB',)]','firstResut = mlb.inverse_transform(np.array [bitarray [0]])',我如何獲得每個標籤的概率? – Leowan