訓練一個sklearn邏輯迴歸分類沒有所有可能的標籤

我想使用scikit學習0.12.1到：訓練一個sklearn邏輯迴歸分類沒有所有可能的標籤

列車邏輯迴歸分類
評估舉行了驗證數據
飼料的分類器向這個分類器提供新數據，併爲每次觀察檢索5個最可能的標籤

除了一個特性外，Sklearn使這一切變得非常簡單。不能保證每個可能的標籤都會出現在用於符合我的分類器的數據中。有數百種可能的標籤，其中一些標籤沒有出現在可用的培訓數據中。

這將導致兩個問題：當它們發生在驗證數據

標籤矢量化不承認以前看不到的標籤。這很容易通過將標籤符合到可能的標籤集來解決，但它加重了問題2.
LogisticRegression分類器的predict_proba方法的輸出是[n_samples，n_classes]數組，其中n_classes包含只有在培訓數據中看到的類。這意味着在predict_proba數組上運行argsort不再提供直接映射到標籤向量化程序的詞彙表的值。

我的問題是，什麼是迫使分類器識別全套可能的類，即使其中一些不存在於訓練數據中的最佳方式是什麼？很明顯，它無法學習它從未見過數據的標籤，但0在我的情況下是完全可用的。

來源

2013-02-22 Alexander Measure

這是一個解決方法。確保你有一個名爲all_classes的所有類別的列表。然後，如果clf是你LogisticRegression分類，

from itertools import repeat 

# determine the classes that were not present in the training set; 
# the ones that were are listed in clf.classes_. 
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes) 

# the order of classes in predict_proba's output matches that in clf.classes_. 
prob = clf.predict_proba(test_samples) 
for row in prob: 
    prob_per_class = (zip(clf.classes_, prob) 
        + zip(classes_not_trained, repeat(0.)))

產生的(cls, prob)對列表。在larsman的出色答卷

來源

2013-02-23 11:21:31

更優雅比工作，我周圍使用。所有sklearn分類器中是否存在classes_屬性？在0.12.1 LogisticRegression中只有label_，但在更高版本中似乎會更改。 – 2013-02-23 16:09:15

@AlexanderMeasure：是的，'classes_'應該出現在所有的分類器上，但目前不是 - 這是一個已知的錯誤，每個類都有固定的基礎。 0.13在LR上有'classes_'，我忘了0.12.1還沒有。 – 2013-02-23 17:19:04

糟糕，這不起作用。 clf.predict_proba返回形狀數組[n_samples，n_clf_classes]。數組迭代跨行，從而使用壓縮類的結果將類壓縮爲來自測試樣本的n_clf_classes長度概率數組，這不是特別有用。但是，如果我們將類壓縮到每行，它就可以工作。 – 2013-02-25 18:23:29

大廈，我結束了這一點：

from itertools import repeat 
import numpy as np 

# determine the classes that were not present in the training set; 
# the ones that were are listed in clf.classes_. 
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes) 

# the order of classes in predict_proba's output matches that in clf.classes_. 
prob = clf.predict_proba(test_samples) 
new_prob = [] 
for row in prob: 
    prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.)) 
    # put the probabilities in class order 
    prob_per_class = sorted(prob_per_class) 
    new_prob.append(i[1] for i in prob_per_class) 
new_prob = np.asarray(new_prob)

new_prob是[N_SAMPLES次，n_classes]數組就像從predict_proba輸出，除了現在它包含0的概率爲前所未見的類。

來源

2013-02-25 19:24:06

如果你想是什麼樣的，通過predict_proba返回數組，但與列對應於排序all_classes，怎麼樣：

all_classes = numpy.array(sorted(all_classes)) 
# Get the probabilities for learnt classes 
prob = clf.predict_proba(test_samples) 
# Create the result matrix, where all values are initially zero 
new_prob = numpy.zeros((prob.shape[0], all_classes.size)) 
# Set the columns corresponding to clf.classes_ 
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob

來源

2013-03-02 13:56:10 joeln

訓練一個sklearn邏輯迴歸分類沒有所有可能的標籤

回答

相關問題