Python sklearn多標籤分類：UserWarning：標籤不是226在所有訓練樣例中都存在

我在嘗試多標籤分類問題。我的數據是這樣的Python sklearn多標籤分類：UserWarning：標籤不是226在所有訓練樣例中都存在

DocID Content    Tags   
1  some text here... [70] 
2  some text here... [59] 
3  some text here... [183] 
4  some text here... [173] 
5  some text here... [71] 
6  some text here... [98] 
7  some text here... [211] 
8  some text here... [188] 
.  .............  ..... 
.  .............  ..... 
.  .............  .....

這裏是我的代碼

traindf = pd.read_csv("mul.csv") 
print "This is what our training data looks like:" 
print traindf 

t=TfidfVectorizer() 

X=traindf["Content"] 

y=traindf["Tags"] 

print "Original Content" 
print X 
X=t.fit_transform(X) 
print "Content After transformation" 
print X 
print "Original Tags" 
print y 
y=MultiLabelBinarizer().fit_transform(y) 
print "Tags After transformation" 
print y 

print "Features extracted:" 
print t.get_feature_names() 
print "Scores of features extracted" 
idf = t.idf_ 
print dict(zip(t.get_feature_names(), idf)) 

print "Splitting into training and validation sets..." 
Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5) 

print "Training Set Content and Tags" 
print Xtrain 
print ytrain 
print "Validation Set Content and Tags" 
print Xvalidate 
print yvalidate 

print "Creating classifier" 
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)) 

clf.fit(Xtrain, ytrain) 

predictions=clf.predict(Xvalidate) 
print "Predicted Tags are:" 
print predictions 
print "Correct Tags on Validation Set are :" 
print yvalidate 
print "Accuracy on validation set: %.3f" % clf.score(Xvalidate,yvalidate)

代碼運行正常，但我不斷收到這些消息

X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 288 is present in all training examples. 
    str(classes[c])) 
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 304 is present in all training examples. 
    str(classes[c])) 
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 340 is present in all training examples.

這是什麼意思？它是否表明我的數據不夠多樣？

來源

2015-12-17 AbtPst

當一些項目出現在所有或多個記錄中時，一些數據挖掘算法存在問題。這是使用Apriori算法進行關聯規則挖掘時的一個問題。

無論是否有問題都取決於分類器。我不知道你正在使用的特定分類器，但這裏有一個例子，它適用於具有最大深度的決策樹。

假設您正在使用Hunt算法和GINI索引來確定最佳分割（請參閱here以獲得解釋，請參見第35張幻燈片），以最大深度擬合決策樹。第一次拆分可以是記錄是否具有標籤288.如果每個記錄都具有該標籤，則GINI索引對於這樣的拆分將是最佳的。這意味着第一個這麼多的分割將是無用的，因爲你實際上並沒有分割訓練集（你在分割一個空集，沒有288，而集本身，288）。所以，樹的第一個如此多的級別是沒用的。如果您設置了最大深度，則可能會導致低精度決策樹。

在任何情況下，您所得到的警告對您的代碼來說都不是問題，至多在您的數據集中。你應該檢查你使用的分類器是否對這種事情敏感，如果是這樣的話，當你過濾出所有發生的標籤時，它可能會給出更好的結果。

來源

2015-12-17 20:01:13 Keelan

Python sklearn多標籤分類：UserWarning：標籤不是226在所有訓練樣例中都存在

回答

相關問題