根據Orange文檔,規則的class_distribution屬性是「此規則涵蓋的數據實例中類的分佈」。但是,如果我將規則應用於派生規則的數據集中的數據實例,則觸發規則r的實例數量有時會與r.class_distribution中的計數不同。Orange:Orange.classification.rules.RuleClassifier.rules中規則的class_distribution屬性中可能存在的錯誤?
例如,如果我使用設置有橙色包和下面的代碼adult_sample數據集:
import numpy as np
import Orange
data = Orange.data.Table("C:\Python27\Lib\site-packages\Orange\datasets/adult_sample")
cn2_learner = Orange.classification.rules.CN2UnorderedLearner()
#only want to learn rules for class0:
cn2_learner.target_class = 0
cn2_classifier = Orange.classification.rules.RuleLearner.__call__(cn2_learner, data, 0)
RS = cn2_classifier.rules #rule set
rulesFired=[[r(d) for r in RS] for d in data]
#Find what rules fire for each data instance
classV = np.array([d.get_class()==data.domain.class_var.values[1] for d in data]).astype(int)
ind0 = np.where(classV==0)[0] #indices of data with class 0
ind1 = np.where(classV==1)[0] #indices of data with class 1
rulesFired0=np.delete(rulesFired, ind1,0) #indicates what rules fired for each class 0 instance
rulesFired1=np.delete(rulesFired, ind0,0) #indicates what rules fired for each class 1 instance
ruleFreq0 = np.sum(rulesFired0,axis=0) #how many class0 instances fired for each rule
ruleFreq1 = np.sum(rulesFired1,axis=0) #how many class1 instances fired for each rule
#Check to see if instances that fired rules match up with r.class_distribution
for ind in range(len(RS)):
r=RS[ind]
if r.class_distribution[0] != ruleFreq0[ind] or r.class_distribution[1] != ruleFreq1[ind]:
print ind #print indices of rules with mismatches
32總分82條規則不具有rule.class_distribution如上所定義匹配ruleFreq。
讓我們使用RS [5]作爲一個例子:
#IF education=['Prof-school'] AND age>31.0 THEN y=>50K<3.000, 0.000>
RS[5].class_distribution = <3.000, 0.000> .
根據這一點,從0類3個實例燒製該規則,但是,ruleFreq0 [5] = 7,當我在所有的運行規則意數據,來自0級的7個實例激發規則。 這7個實例由ind0 [np.where(rulesFired0 [:,5])[0]]索引。一些例子是:
#data[220]: [43.000000, 'Private', 350661.000000, 'Prof-school', 15.000000, 'Separated', 'Tech-support', 'Not-in-family', 'White', 'Male', 0.000000, 0.000000, 50.000000, 'Columbia', '>50K']
#data[240]: [43.000000, 'State-gov', 33331.000000, 'Prof-school', 15.000000, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'White', 'Male', 0.000000, 1977.000000, 70.000000, 'United-States', '>50K']
#data[372]: [41.000000, 'Private', 130126.000000, 'Prof-school', 15.000000, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'White', 'Male', 0.000000, 0.000000, 80.000000, 'United-States', '>50K']
最後,這裏是我的問題:
這是橙代碼中的錯誤或不將class_distribution屬性指定除實例的數量以外的東西(從整個用於學習規則的數據集)來自每個啓動規則的類?
此class_distribution用於計算規則的質量嗎?這將意味着class_distribution計算中的錯誤會導致規則質量計算中的錯誤。
謝謝!這就說得通了。也許文檔中的描述應該反映這一點。看起來規則屬性的其餘部分也只考慮了以前誘導規則未涉及的實例。 – nmag