如何探索使用scikit學習

3

下面的代碼應該產生的十大功能一個情節：

import numpy as np 
import matplotlib.pyplot as plt 

importances = clf.feature_importances_ 
std = np.std(clf.feature_importances_,axis=0) 
indices = np.argsort(importances)[::-1] 

# Print the feature ranking 
print("Feature ranking:") 

for f in range(10): 
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]])) 

# Plot the feature importances of the forest 
plt.figure() 
plt.title("Feature importances") 
plt.bar(range(10), importances[indices], 
     color="r", yerr=std[indices], align="center") 
plt.xticks(range(10), indices) 
plt.xlim([-1, 10]) 
plt.show()

從here取出並略作修改，以適應DecisionTreeClassifier。

這並不完全有助於您探索樹，但它確實會告訴您關於樹的信息。

來源

2015-09-10 17:23:32

+0

謝謝，但我希望看到哪些培訓數據落入每片葉子，例如。目前，我必須繪製決策樹，寫下規則，編寫腳本以使用這些規則過濾數據。這不可能是正確的方式！ – eleanora

+0

您的數據是否足夠小以便通過手動或電子表格運行這些計算？我假設這是爲了一個類，在這種情況下，最好不要運行該算法並複製結構。也就是說，我想有一些方法可以從sci-kit獲得樹的結構。以下是DecisionTreeClassifier的源代碼：https：//github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py –

+0

這不適合上課！我有大約1000000項，所以我通過編寫一個單獨的python腳本來完成。然而，我甚至不知道如何自動提取每個葉子的規則。有沒有辦法？ – eleanora

5

您需要使用預測方法。

訓練樹後，您輸入X值以預測其輸出。

from sklearn.datasets import load_iris 
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(random_state=0) 
iris = load_iris() 
tree = clf.fit(iris.data, iris.target) 
tree.predict(iris.data)

輸出：

>>> tree.predict(iris.data) 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

要得到的樹結構的詳細信息，我們就可以使用tree_。 有getstate（）

樹結構轉換成一個「ASCII藝術」圖片

   0 
     _____________ 
     1   2 
       ______________ 
       3   12 
      _______  _______ 
      4  7  13 16 
      ___ ______  _____ 
      5 6 8 9  14 15 
         _____ 
         10 11

樹結構的數組。

In [38]: tree.tree_.__getstate__()['nodes'] 
Out[38]: 
array([(1, 2, 3, 0.800000011920929, 0.6666666666666667, 150, 150.0), 
     (-1, -1, -2, -2.0, 0.0, 50, 50.0), 
     (3, 12, 3, 1.75, 0.5, 100, 100.0), 
     (4, 7, 2, 4.949999809265137, 0.16803840877914955, 54, 54.0), 
     (5, 6, 3, 1.6500000953674316, 0.04079861111111116, 48, 48.0), 
     (-1, -1, -2, -2.0, 0.0, 47, 47.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (8, 9, 3, 1.5499999523162842, 0.4444444444444444, 6, 6.0), 
     (-1, -1, -2, -2.0, 0.0, 3, 3.0), 
     (10, 11, 2, 5.449999809265137, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (13, 16, 2, 4.850000381469727, 0.042533081285444196, 46, 46.0), 
     (14, 15, 1, 3.0999999046325684, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (-1, -1, -2, -2.0, 0.0, 43, 43.0)], 
     dtype=[('left_child', '<i8'), ('right_child', '<i8'), 
      ('feature', '<i8'), ('threshold', '<f8'), 
      ('impurity', '<f8'), ('n_node_samples', '<i8'), 
      ('weighted_n_node_samples', '<f8')])

其中：

第一節點[0]是根節點。
內部節點具有left_child和right_child指向具有正值且大於當前節點的節點。
葉子左側和右側子節點的值爲-1。
節點1,5,6,8,10,11,14,15,16是樹葉。
節點結構是使用深度優先搜索算法構建的。
特徵字段告訴我們在節點中使用了哪些iris.data特徵來確定此示例的路徑。
閾值告訴我們用於評估基於特徵的方向的值。
雜質在葉子處達到0 ...因爲一旦到達葉子，所有樣品都處於同一級別。
n_node_samples告訴我們每個葉子有多少個樣本。

使用這些信息，我們可以通過遵循腳本上的分類規則和閾值，將每個樣本X簡單地跟蹤到它最終着陸的葉子。此外，n_node_samples將允許我們執行單元測試，以確保每個節點都獲得正確數量的樣本。然後使用樹的輸出。預測，我們可以將每片葉子映射到關聯的類。

來源

2015-09-10 17:36:01 PabTorre

+0

謝謝。這告訴我這個類，但不是每個項目所在的決策樹的哪一個葉。如果我能夠以某種方式提取需要到達每個葉的規則，我可以重新運行這些規則來處理數據。 – eleanora

+0

當你說你想看到葉子，你的意思是你想看到樹在每個節點上使用的規則？如果是這種情況，那麼這可能會有所幫助：http：// stackoverflow。com/questions/20224526 /如何從scikit-learn-decision-tree中提取決策規則 – PabTorre

+0

對於給定的葉子，我希望看到決策樹將放置在該葉子上的訓練數據。換句話說，每個葉片都與一系列規則（比較）相關聯。如果您應用這些規則，我希望看到您獲得的數據的子集。 – eleanora

5

注意：這不是一個答案，只是提示可能的解決方案。

我最近在我的項目中遇到了類似的問題。我的目標是爲某些特定樣本提取相應的決策鏈。我認爲你的問題是我的一個子集，因爲你只需要記錄決策鏈的最後一步。

到目前爲止，似乎唯一可行的解決方案是在Python中編寫定製的predict方法來跟蹤一路上的決策。原因是scikit-learn提供的方法predict無法做到這一點（據我所知）。更糟糕的是，它是C實現的包裝器，很難定製。

自定義對我的問題很好，因爲我正在處理一個不平衡的數據集，我關心的樣本（正數）很少。所以我可以先使用sklearn predict將它們過濾出來，然後使用我的定製獲得決策鏈。

但是，如果您有大型數據集，這可能對您無效。因爲如果你解析樹並在Python中進行預測，它將以Python速度運行並且不會（容易）縮放。您可能需要回退以定製C實現。

來源

2015-09-11 08:49:33 zaxliu

+0

儘可能多地包含研究的部分答案仍然可以接受。 – Xufox

+0

謝謝。沒有時間來實現這個想法。希望有代碼的人很快會出現。 – zaxliu

3

此代碼將完全按照您的要求進行操作。這裏，n是X_train中的數字意見。最後，（n，number_of_leaves）大小的數組leaf_observations在每列中保存用於索引到X_train中的布爾值以獲得每個葉中的觀察值。 leaf_observations的每一列對應於leaves中的一個元素，它具有樹葉的節點ID。

# get the nodes which are leaves 
leaves = clf.tree_.children_left == -1 
leaves = np.arange(0,clf.tree_.node_count)[leaves] 

# loop through each leaf and figure out the data in it 
leaf_observations = np.zeros((n,len(leaves)),dtype=bool) 
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
thistree = [clf.tree_.feature.tolist()] 
thistree.append(clf.tree_.threshold.tolist()) 
thistree.append(clf.tree_.children_left.tolist()) 
thistree.append(clf.tree_.children_right.tolist()) 
# get the decision rules for each leaf node & apply them 
for (ind,nod) in enumerate(leaves): 
    # get the decision rules in numeric list form 
    rules = [] 
    RevTraverseTree(thistree, nod, rules) 
    # convert & apply to the data by sequentially &ing the rules 
    thisnode = np.ones(n,dtype=bool) 
    for rule in rules: 
     if rule[1] == 1: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] > rule[2]) 
     else: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] <= rule[2]) 
    # get the observations that obey all the rules - they are the ones in this leaf node 
    leaf_observations[:,ind] = thisnode

這需要這裏定義的輔助功能，它遞歸從指定節點遍歷樹開始建立決策規則。

def RevTraverseTree(tree, node, rules): 
    ''' 
    Traverase an skl decision tree from a node (presumably a leaf node) 
    up to the top, building the decision rules. The rules should be 
    input as an empty list, which will be modified in place. The result 
    is a nested list of tuples: (feature, direction (left=-1), threshold). 
    The "tree" is a nested list of simplified tree attributes: 
    [split feature, split threshold, left node, right node] 
    ''' 
    # now find the node as either a left or right child of something 
    # first try to find it as a left node 
    try: 
     prevnode = tree[2].index(node) 
     leftright = -1 
    except ValueError: 
     # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
     prevnode = tree[3].index(node) 
     leftright = 1 
    # now let's get the rule that caused prevnode to -> node 
    rules.append((tree[0][prevnode],leftright,tree[1][prevnode])) 
    # if we've not yet reached the top, go up the tree one more step 
    if prevnode != 0: 
     RevTraverseTree(tree, prevnode, rules)

來源

2016-03-10 08:55:40

1

我認爲一個簡單的選擇是使用訓練的決策樹的apply方法。從返回指數訓練樹，應用traindata，並建立一個查找表：

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

# apply training data to decision tree 
leaf_indices = clf.apply(iris.data) 
lookup = {} 

# build lookup table 
for i, leaf_index in enumerate(leaf_indices): 
    try: 
     lookup[leaf_index].append(iris.data[i]) 
    except KeyError: 
     lookup[leaf_index] = [] 
     lookup[leaf_index].append(iris.data[i]) 

# test 
unkown_sample = [[4., 3.1, 6.1, 1.2]] 
index = clf.apply(unkown_sample) 
print(lookup[index[0]])

來源

2016-10-02 11:39:33 maltesar

0

您是否嘗試過反傾銷的DecisionTree成graphviz的」 .DOT文件[1]，然後用graph_tool加載[2]。：

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 
from graph_tool.all import * 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

tree.export_graphviz(clf,out_file='tree.dot') 

#load graph with graph_tool and explore structure as you please 
g = load_graph('tree.dot') 

for v in g.vertices(): 
    for e in v.out_edges(): 
     print(e) 
    for w in v.out_neighbours(): 
     print(w)

[1] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

[2] https://graph-tool.skewed.de/

來源

2017-04-03 16:09:38

+0

你可以這樣美麗嗎？在http://scikit-learn.org/stable/_images/iris.svg？ – eleanora

+0

一旦輸出與export_graphiz類似的東西可以實現點-Tpng tree.dot -o tree.png。 –

2

我已經改變了一點什麼德魯博士公佈。
下面的代碼，給定一個數據幀和被裝配後的決策樹，返回：

rules_list：條目的列表對於每個（條目：規則

values_path列表類通過路徑去）

import numpy as np 
import pandas as pd 
from sklearn.tree import DecisionTreeClassifier 

def get_rules(dtc, df): 
    rules_list = [] 
    values_path = [] 
    values = dtc.tree_.value 

    def RevTraverseTree(tree, node, rules, pathValues): 
     ''' 
     Traverase an skl decision tree from a node (presumably a leaf node) 
     up to the top, building the decision rules. The rules should be 
     input as an empty list, which will be modified in place. The result 
     is a nested list of tuples: (feature, direction (left=-1), threshold). 
     The "tree" is a nested list of simplified tree attributes: 
     [split feature, split threshold, left node, right node] 
     ''' 
     # now find the node as either a left or right child of something 
     # first try to find it as a left node    

     try: 
      prevnode = tree[2].index(node)   
      leftright = '<=' 
      pathValues.append(values[prevnode]) 
     except ValueError: 
      # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
      prevnode = tree[3].index(node) 
      leftright = '>' 
      pathValues.append(values[prevnode]) 

     # now let's get the rule that caused prevnode to -> node 
     p1 = df.columns[tree[0][prevnode]]  
     p2 = tree[1][prevnode]  
     rules.append(str(p1) + ' ' + leftright + ' ' + str(p2)) 

     # if we've not yet reached the top, go up the tree one more step 
     if prevnode != 0: 
      RevTraverseTree(tree, prevnode, rules, pathValues) 

    # get the nodes which are leaves 
    leaves = dtc.tree_.children_left == -1 
    leaves = np.arange(0,dtc.tree_.node_count)[leaves] 

    # build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
    thistree = [dtc.tree_.feature.tolist()] 
    thistree.append(dtc.tree_.threshold.tolist()) 
    thistree.append(dtc.tree_.children_left.tolist()) 
    thistree.append(dtc.tree_.children_right.tolist()) 

    # get the decision rules for each leaf node & apply them 
    for (ind,nod) in enumerate(leaves): 

     # get the decision rules 
     rules = [] 
     pathValues = [] 
     RevTraverseTree(thistree, nod, rules, pathValues) 

     pathValues.insert(0, values[nod])  
     pathValues = list(reversed(pathValues)) 

     rules = list(reversed(rules)) 

     rules_list.append(rules) 
     values_path.append(pathValues) 

    return (r, values_path)

它遵循一個例子：

df = pd.read_csv('df.csv') 

X = df[df.columns[:-1]] 
y = df['classification'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

dtc = DecisionTreeClassifier(max_depth=2) 
dtc.fit(X_train, y_train)

決策樹安裝已經生成以下樹：Decision Tree with width 2

在這一點上，只是調用該函數：

get_rules(df, dtc)

這是該函數返回的內容：

rules = [ 
    ['first <= 63.5', 'first <= 43.5'], 
    ['first <= 63.5', 'first > 43.5'], 
    ['first > 63.5', 'second <= 19.700000762939453'], 
    ['first > 63.5', 'second > 19.700000762939453'] 
] 

values = [ 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 284., 57.]])], 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 352., 184.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 645., 620.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 287., 708.]])] 
]

顯然，在值中，對於每條路徑，也有葉子值。

來源

2018-01-19 10:32:16

如何探索使用scikit學習

回答

相關問題