統計星火mllib DecisionTree

學習mllib DecisionTree模型後（http://spark.apache.org/docs/latest/mllib-decision-tree.html）如何計算節點的統計數據，如支持（有多少樣本此子匹配），而多少樣本每個標籤的匹配子樹？統計星火mllib DecisionTree

如果它更容易，我也樂於接受任何其他工具比星火採取調試字符串，並計算這些統計數據。調試字符串的例子：

DecisionTreeModel classifier of depth 20 with 20031 nodes 
    If (feature 0 <= -35.0) 
    If (feature 24 <= 176.0) 
    If (feature 0 <= -200.0) 
    If (feature 29 <= 109.0) 
     If (feature 6 <= -156.0) 
     If (feature 9 <= 0.0) 
     If (feature 20 <= -116.0) 
     If (feature 16 <= 203.0) 
      If (feature 11 <= 163.0) 
      If (feature 5 <= 384.0) 
      If (feature 15 <= 325.0) 
      If (feature 13 <= -248.0) 
       If (feature 20 <= -146.0) 
       Predict: 0.0 
       Else (feature 20 > -146.0) 
       If (feature 19 <= -58.0) 
       Predict: 6.0 
       Else (feature 19 > -58.0) 
       Predict: 0.0 
      Else (feature 13 > -248.0) 
       If (feature 9 <= -26.0) 
       Predict: 0.0 
       Else (feature 9 > -26.0) 
       If (feature 10 <= 218.0) 
...

我使用的，因爲外的核心學習的mllib，這是我需要的，因爲數據不適合到內存中。如果你有比mllib更好的選擇，我很樂意給他們一個嘗試。

來源

2016-06-07 DreamFlasher

我使用sklearn的算法創建我的模型，並與星火情況下，整合到產生這樣的輸出：

if (device_type_id <= 1) 
    39 Clicks - 0.61% 
    2135 Conversions - 33.32% 
else (device_type_id > 1) 
    if (country_id <= 216) 
     1097 Clicks - 17.12% 
    else (country_id > 216) 
     if (browser_id <= 2) 
      296 Clicks - 4.62% 
     else (browser_id > 2) 
      if (browser_id <= 4) 
       if (browser_id <= 3) 
        if (operating_system_id <= 2) 
         262 Clicks - 4.09%

這裏是我用來顯示這樣的樹代碼：

def get_code(count_df, tree, feature_names, target_names, spacer_base=" "): 
    left  = tree.tree_.children_left 
    right  = tree.tree_.children_right 
    threshold = tree.tree_.threshold 
    features = [feature_names[i] for i in tree.tree_.feature] 
    value = tree.tree_.value 
    temp_list = [] 
    res_count = count_df 
    def recurse(res_count, temp_list, left, right, threshold, features, node, depth): 
     spacer = spacer_base * depth 
     if (threshold[node] != -2): 
      temp_list.append("if (" + features[node] + " <= " + \ 
       str(int(round(threshold[node] - 1))) + ")") 
      if left[node] != -1: 
        recurse (res_count, temp_list, left, right, threshold, features, left[node], depth+1) 
      temp_list.append("else (" + features[node] + " > " + \ 
       str(int(round(threshold[node] - 1))) + ")") 
      if right[node] != -1: 
        recurse (res_count, temp_list, left, right, threshold, features, right[node], depth+1) 

     else: 
      target = value[node] 
      for i, v in zip(np.nonzero(target)[1], target[np.nonzero(target)]): 
       target_name = target_names[i] 
       target_count = int(v) 
       temp_list.append(str(target_count) +" "+ str(target_name) + " - " + str(round((target_count/res_count), 4) * 100)+ "%") 

    recurse(res_count, temp_list, left, right, threshold, features, 0, 0) 
    return temp_list

否則，請參考我在帖子here中提供的答案，但是它寫的是Scala，改變方式Spark生成決策樹。

來源

2016-06-07 09:57:04 RoyaumeIX

，因爲他們不支持在線/出核心培訓的我不能使用sklearn決策樹。但是，你得到看起來可能是我想要的輸出（你有兩個標籤，點擊和轉化是這樣嗎？）。你能提供一些代碼來獲得這個輸出嗎？我是否也可以從spark mllib模型中獲取它？ – DreamFlasher

我已經更新了我的答案。 – RoyaumeIX

統計星火mllib DecisionTree

回答

相關問題