2015-09-10 90 views
6

我使用如何探索使用scikit學習

clf = tree.DecisionTreeClassifier() 
clf = clf.fit(X_train, Y_train) 

這一切工作正常建設決策樹構建決策樹。但是,如何才能探索決策樹?

例如,如何從X_train中找到哪些條目出現在特定的葉中?

+3

碰到類似的問題。你可能會發現我的答案[在這裏](和http://cn.tankoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/42227468#42227468)那裏提到的演練)有幫助。它使用0.18版本中的方法'decision_path'。如果對觀看訓練樣本感興趣,可在幾個地方用'X_train'替代'X_test'。 – Kevin

回答

3

下面的代碼應該產生的十大功能一個情節:

import numpy as np 
import matplotlib.pyplot as plt 

importances = clf.feature_importances_ 
std = np.std(clf.feature_importances_,axis=0) 
indices = np.argsort(importances)[::-1] 

# Print the feature ranking 
print("Feature ranking:") 

for f in range(10): 
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]])) 

# Plot the feature importances of the forest 
plt.figure() 
plt.title("Feature importances") 
plt.bar(range(10), importances[indices], 
     color="r", yerr=std[indices], align="center") 
plt.xticks(range(10), indices) 
plt.xlim([-1, 10]) 
plt.show() 

here取出並略作修改,以適應DecisionTreeClassifier

這並不完全有助於您探索樹,但它確實會告訴您關於樹的信息。

+0

謝謝,但我希望看到哪些培訓數據落入每片葉子,例如。目前,我必須繪製決策樹,寫下規則,編寫腳本以使用這些規則過濾數據。這不可能是正確的方式! – eleanora

+0

您的數據是否足夠小以便通過手動或電子表格運行這些計算?我假設這是爲了一個類,在這種情況下,最好不要運行該算法並複製結構。也就是說,我想有一些方法可以從sci-kit獲得樹的結構。以下是DecisionTreeClassifier的源代碼:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py –

+0

這不適合上課!我有大約1000000項,所以我通過編寫一個單獨的python腳本來完成。然而,我甚至不知道如何自動提取每個葉子的規則。有沒有辦法? – eleanora

5

您需要使用預測方法。

訓練樹後,您輸入X值以預測其輸出。

from sklearn.datasets import load_iris 
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(random_state=0) 
iris = load_iris() 
tree = clf.fit(iris.data, iris.target) 
tree.predict(iris.data) 

輸出:

>>> tree.predict(iris.data) 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) 

要得到的樹結構的詳細信息,我們就可以使用tree_。 有getstate()

樹結構轉換成一個 「ASCII藝術」 圖片

   0 
     _____________ 
     1   2 
       ______________ 
       3   12 
      _______  _______ 
      4  7  13 16 
      ___ ______  _____ 
      5 6 8 9  14 15 
         _____ 
         10 11 

樹結構的數組。

In [38]: tree.tree_.__getstate__()['nodes'] 
Out[38]: 
array([(1, 2, 3, 0.800000011920929, 0.6666666666666667, 150, 150.0), 
     (-1, -1, -2, -2.0, 0.0, 50, 50.0), 
     (3, 12, 3, 1.75, 0.5, 100, 100.0), 
     (4, 7, 2, 4.949999809265137, 0.16803840877914955, 54, 54.0), 
     (5, 6, 3, 1.6500000953674316, 0.04079861111111116, 48, 48.0), 
     (-1, -1, -2, -2.0, 0.0, 47, 47.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (8, 9, 3, 1.5499999523162842, 0.4444444444444444, 6, 6.0), 
     (-1, -1, -2, -2.0, 0.0, 3, 3.0), 
     (10, 11, 2, 5.449999809265137, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (13, 16, 2, 4.850000381469727, 0.042533081285444196, 46, 46.0), 
     (14, 15, 1, 3.0999999046325684, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (-1, -1, -2, -2.0, 0.0, 43, 43.0)], 
     dtype=[('left_child', '<i8'), ('right_child', '<i8'), 
      ('feature', '<i8'), ('threshold', '<f8'), 
      ('impurity', '<f8'), ('n_node_samples', '<i8'), 
      ('weighted_n_node_samples', '<f8')]) 

其中:

  • 第一節點[0]是根節點。
  • 內部節點具有left_child和right_child指向具有正值且大於當前節點的節點。
  • 葉子左側和右側子節點的值爲-1。
  • 節點1,5,6,8,10,11,14,15,16是樹葉。
  • 節點結構是使用深度優先搜索算法構建的。
  • 特徵字段告訴我們在節點中使用了哪些iris.data特徵來確定此示例的路徑。
  • 閾值告訴我們用於評估基於特徵的方向的值。
  • 雜質在葉子處達到0 ...因爲一旦到達葉子,所有樣品都處於同一級別。
  • n_node_samples告訴我們每個葉子有多少個樣本。

使用這些信息,我們可以通過遵循腳本上的分類規則和閾值,將每個樣本X簡單地跟蹤到它最終着陸的葉子。此外,n_node_samples將允許我們執行單元測試,以確保每個節點都獲得正確數量的樣本。然後使用樹的輸出。預測,我們可以將每片葉子映射到關聯的類。

+0

謝謝。這告訴我這個類,但不是每個項目所在的決策樹的哪一個葉。如果我能夠以某種方式提取需要到達每個葉的規則,我可以重新運行這些規則來處理數據。 – eleanora

+0

當你說你想看到葉子,你的意思是你想看到樹在每個節點上使用的規則? 如果是這種情況,那麼這可能會有所幫助:http:// stackoverflow。com/questions/20224526 /如何從scikit-learn-decision-tree中提取決策規則 – PabTorre

+0

對於給定的葉子,我希望看到決策樹將放置在該葉子上的訓練數據。換句話說,每個葉片都與一系列規則(比較)相關聯。如果您應用這些規則,我希望看到您獲得的數據的子集。 – eleanora

5

注意:這不是一個答案,只是提示可能的解決方案。

我最近在我的項目中遇到了類似的問題。我的目標是爲某些特定樣本提取相應的決策鏈。我認爲你的問題是我的一個子集,因爲你只需要記錄決策鏈的最後一步。

到目前爲止,似乎唯一可行的解​​決方案是在Python中編寫定製的predict方法來跟蹤一路上的決策。原因是scikit-learn提供的方法predict無法做到這一點(據我所知)。更糟糕的是,它是C實現的包裝器,很難定製。

自定義對我的問題很好,因爲我正在處理一個不平衡的數據集,我關心的樣本(正數)很少。所以我可以先使用sklearn predict將它們過濾出來,然後使用我的定製獲得決策鏈。

但是,如果您有大型數據集,這可能對您無效。因爲如果你解析樹並在Python中進行預測,它將以Python速度運行並且不會(容易)縮放。您可能需要回退以定製C實現。

+0

儘可能多地包含研究的部分答案仍然可以接受。 – Xufox

+0

謝謝。沒有時間來實現這個想法。希望有代碼的人很快會出現。 – zaxliu

3

此代碼將完全按照您的要求進行操作。這裏,nX_train中的數字意見。最後,(n,number_of_leaves)大小的數組leaf_observations在每列中保存用於索引到X_train中的布爾值以獲得每個葉中的觀察值。 leaf_observations的每一列對應於leaves中的一個元素,它具有樹葉的節點ID。

# get the nodes which are leaves 
leaves = clf.tree_.children_left == -1 
leaves = np.arange(0,clf.tree_.node_count)[leaves] 

# loop through each leaf and figure out the data in it 
leaf_observations = np.zeros((n,len(leaves)),dtype=bool) 
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
thistree = [clf.tree_.feature.tolist()] 
thistree.append(clf.tree_.threshold.tolist()) 
thistree.append(clf.tree_.children_left.tolist()) 
thistree.append(clf.tree_.children_right.tolist()) 
# get the decision rules for each leaf node & apply them 
for (ind,nod) in enumerate(leaves): 
    # get the decision rules in numeric list form 
    rules = [] 
    RevTraverseTree(thistree, nod, rules) 
    # convert & apply to the data by sequentially &ing the rules 
    thisnode = np.ones(n,dtype=bool) 
    for rule in rules: 
     if rule[1] == 1: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] > rule[2]) 
     else: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] <= rule[2]) 
    # get the observations that obey all the rules - they are the ones in this leaf node 
    leaf_observations[:,ind] = thisnode 

這需要這裏定義的輔助功能,它遞歸從指定節點遍歷樹開始建立決策規則。

def RevTraverseTree(tree, node, rules): 
    ''' 
    Traverase an skl decision tree from a node (presumably a leaf node) 
    up to the top, building the decision rules. The rules should be 
    input as an empty list, which will be modified in place. The result 
    is a nested list of tuples: (feature, direction (left=-1), threshold). 
    The "tree" is a nested list of simplified tree attributes: 
    [split feature, split threshold, left node, right node] 
    ''' 
    # now find the node as either a left or right child of something 
    # first try to find it as a left node 
    try: 
     prevnode = tree[2].index(node) 
     leftright = -1 
    except ValueError: 
     # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
     prevnode = tree[3].index(node) 
     leftright = 1 
    # now let's get the rule that caused prevnode to -> node 
    rules.append((tree[0][prevnode],leftright,tree[1][prevnode])) 
    # if we've not yet reached the top, go up the tree one more step 
    if prevnode != 0: 
     RevTraverseTree(tree, prevnode, rules) 
1

我認爲一個簡單的選擇是使用訓練的決策樹的apply方法。從返回指數訓​​練樹,應用traindata,並建立一個查找表:

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

# apply training data to decision tree 
leaf_indices = clf.apply(iris.data) 
lookup = {} 

# build lookup table 
for i, leaf_index in enumerate(leaf_indices): 
    try: 
     lookup[leaf_index].append(iris.data[i]) 
    except KeyError: 
     lookup[leaf_index] = [] 
     lookup[leaf_index].append(iris.data[i]) 

# test 
unkown_sample = [[4., 3.1, 6.1, 1.2]] 
index = clf.apply(unkown_sample) 
print(lookup[index[0]]) 
0

您是否嘗試過反傾銷的D​​ecisionTree成graphviz的」 .DOT文件[1],然後用graph_tool加載[2]。 :

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 
from graph_tool.all import * 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

tree.export_graphviz(clf,out_file='tree.dot') 

#load graph with graph_tool and explore structure as you please 
g = load_graph('tree.dot') 

for v in g.vertices(): 
    for e in v.out_edges(): 
     print(e) 
    for w in v.out_neighbours(): 
     print(w) 

[1] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

[2] https://graph-tool.skewed.de/

+0

你可以這樣美麗嗎?在http://scikit-learn.org/stable/_images/iris.svg? – eleanora

+0

一旦輸出與export_graphiz類似的東西可以實現點-Tpng tree.dot -o tree.png。 –

2

我已經改變了一點什麼德魯博士公佈。
下面的代碼,給定一個數據幀和被裝配後的決策樹,返回:

  • rules_list:條目的列表對於每個(條目:規則
  • values_path列表類通過路徑去)

    import numpy as np 
    import pandas as pd 
    from sklearn.tree import DecisionTreeClassifier 
    
    def get_rules(dtc, df): 
        rules_list = [] 
        values_path = [] 
        values = dtc.tree_.value 
    
        def RevTraverseTree(tree, node, rules, pathValues): 
         ''' 
         Traverase an skl decision tree from a node (presumably a leaf node) 
         up to the top, building the decision rules. The rules should be 
         input as an empty list, which will be modified in place. The result 
         is a nested list of tuples: (feature, direction (left=-1), threshold). 
         The "tree" is a nested list of simplified tree attributes: 
         [split feature, split threshold, left node, right node] 
         ''' 
         # now find the node as either a left or right child of something 
         # first try to find it as a left node    
    
         try: 
          prevnode = tree[2].index(node)   
          leftright = '<=' 
          pathValues.append(values[prevnode]) 
         except ValueError: 
          # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
          prevnode = tree[3].index(node) 
          leftright = '>' 
          pathValues.append(values[prevnode]) 
    
         # now let's get the rule that caused prevnode to -> node 
         p1 = df.columns[tree[0][prevnode]]  
         p2 = tree[1][prevnode]  
         rules.append(str(p1) + ' ' + leftright + ' ' + str(p2)) 
    
         # if we've not yet reached the top, go up the tree one more step 
         if prevnode != 0: 
          RevTraverseTree(tree, prevnode, rules, pathValues) 
    
        # get the nodes which are leaves 
        leaves = dtc.tree_.children_left == -1 
        leaves = np.arange(0,dtc.tree_.node_count)[leaves] 
    
        # build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
        thistree = [dtc.tree_.feature.tolist()] 
        thistree.append(dtc.tree_.threshold.tolist()) 
        thistree.append(dtc.tree_.children_left.tolist()) 
        thistree.append(dtc.tree_.children_right.tolist()) 
    
        # get the decision rules for each leaf node & apply them 
        for (ind,nod) in enumerate(leaves): 
    
         # get the decision rules 
         rules = [] 
         pathValues = [] 
         RevTraverseTree(thistree, nod, rules, pathValues) 
    
         pathValues.insert(0, values[nod])  
         pathValues = list(reversed(pathValues)) 
    
         rules = list(reversed(rules)) 
    
         rules_list.append(rules) 
         values_path.append(pathValues) 
    
        return (r, values_path) 
    

它遵循一個例子:

df = pd.read_csv('df.csv') 

X = df[df.columns[:-1]] 
y = df['classification'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

dtc = DecisionTreeClassifier(max_depth=2) 
dtc.fit(X_train, y_train) 

決策樹安裝已經生成以下樹:Decision Tree with width 2

在這一點上,只是調用該函數:

get_rules(df, dtc) 

這是該函數返回的內容:

rules = [ 
    ['first <= 63.5', 'first <= 43.5'], 
    ['first <= 63.5', 'first > 43.5'], 
    ['first > 63.5', 'second <= 19.700000762939453'], 
    ['first > 63.5', 'second > 19.700000762939453'] 
] 

values = [ 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 284., 57.]])], 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 352., 184.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 645., 620.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 287., 708.]])] 
] 

顯然,在值中,對於每條路徑,也有葉子值。