clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
這一切工作正常建設決策樹構建決策樹。但是,如何才能探索決策樹?
例如,如何從X_train中找到哪些條目出現在特定的葉中?
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
這一切工作正常建設決策樹構建決策樹。但是,如何才能探索決策樹?
例如,如何從X_train中找到哪些條目出現在特定的葉中?
下面的代碼應該產生的十大功能一個情節:
import numpy as np
import matplotlib.pyplot as plt
importances = clf.feature_importances_
std = np.std(clf.feature_importances_,axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(10):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(10), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(10), indices)
plt.xlim([-1, 10])
plt.show()
從here取出並略作修改,以適應DecisionTreeClassifier。
這並不完全有助於您探索樹,但它確實會告訴您關於樹的信息。
謝謝,但我希望看到哪些培訓數據落入每片葉子,例如。目前,我必須繪製決策樹,寫下規則,編寫腳本以使用這些規則過濾數據。這不可能是正確的方式! – eleanora
您的數據是否足夠小以便通過手動或電子表格運行這些計算?我假設這是爲了一個類,在這種情況下,最好不要運行該算法並複製結構。也就是說,我想有一些方法可以從sci-kit獲得樹的結構。以下是DecisionTreeClassifier的源代碼:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py –
這不適合上課!我有大約1000000項,所以我通過編寫一個單獨的python腳本來完成。然而,我甚至不知道如何自動提取每個葉子的規則。有沒有辦法? – eleanora
您需要使用預測方法。
訓練樹後,您輸入X值以預測其輸出。
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
tree = clf.fit(iris.data, iris.target)
tree.predict(iris.data)
輸出:
>>> tree.predict(iris.data)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
要得到的樹結構的詳細信息,我們就可以使用tree_。 有getstate()
樹結構轉換成一個 「ASCII藝術」 圖片
0
_____________
1 2
______________
3 12
_______ _______
4 7 13 16
___ ______ _____
5 6 8 9 14 15
_____
10 11
樹結構的數組。
In [38]: tree.tree_.__getstate__()['nodes']
Out[38]:
array([(1, 2, 3, 0.800000011920929, 0.6666666666666667, 150, 150.0),
(-1, -1, -2, -2.0, 0.0, 50, 50.0),
(3, 12, 3, 1.75, 0.5, 100, 100.0),
(4, 7, 2, 4.949999809265137, 0.16803840877914955, 54, 54.0),
(5, 6, 3, 1.6500000953674316, 0.04079861111111116, 48, 48.0),
(-1, -1, -2, -2.0, 0.0, 47, 47.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(8, 9, 3, 1.5499999523162842, 0.4444444444444444, 6, 6.0),
(-1, -1, -2, -2.0, 0.0, 3, 3.0),
(10, 11, 2, 5.449999809265137, 0.4444444444444444, 3, 3.0),
(-1, -1, -2, -2.0, 0.0, 2, 2.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(13, 16, 2, 4.850000381469727, 0.042533081285444196, 46, 46.0),
(14, 15, 1, 3.0999999046325684, 0.4444444444444444, 3, 3.0),
(-1, -1, -2, -2.0, 0.0, 2, 2.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(-1, -1, -2, -2.0, 0.0, 43, 43.0)],
dtype=[('left_child', '<i8'), ('right_child', '<i8'),
('feature', '<i8'), ('threshold', '<f8'),
('impurity', '<f8'), ('n_node_samples', '<i8'),
('weighted_n_node_samples', '<f8')])
其中:
使用這些信息,我們可以通過遵循腳本上的分類規則和閾值,將每個樣本X簡單地跟蹤到它最終着陸的葉子。此外,n_node_samples將允許我們執行單元測試,以確保每個節點都獲得正確數量的樣本。然後使用樹的輸出。預測,我們可以將每片葉子映射到關聯的類。
謝謝。這告訴我這個類,但不是每個項目所在的決策樹的哪一個葉。如果我能夠以某種方式提取需要到達每個葉的規則,我可以重新運行這些規則來處理數據。 – eleanora
當你說你想看到葉子,你的意思是你想看到樹在每個節點上使用的規則? 如果是這種情況,那麼這可能會有所幫助:http:// stackoverflow。com/questions/20224526 /如何從scikit-learn-decision-tree中提取決策規則 – PabTorre
對於給定的葉子,我希望看到決策樹將放置在該葉子上的訓練數據。換句話說,每個葉片都與一系列規則(比較)相關聯。如果您應用這些規則,我希望看到您獲得的數據的子集。 – eleanora
注意:這不是一個答案,只是提示可能的解決方案。
我最近在我的項目中遇到了類似的問題。我的目標是爲某些特定樣本提取相應的決策鏈。我認爲你的問題是我的一個子集,因爲你只需要記錄決策鏈的最後一步。
到目前爲止,似乎唯一可行的解決方案是在Python中編寫定製的predict
方法來跟蹤一路上的決策。原因是scikit-learn提供的方法predict
無法做到這一點(據我所知)。更糟糕的是,它是C實現的包裝器,很難定製。
自定義對我的問題很好,因爲我正在處理一個不平衡的數據集,我關心的樣本(正數)很少。所以我可以先使用sklearn predict
將它們過濾出來,然後使用我的定製獲得決策鏈。
但是,如果您有大型數據集,這可能對您無效。因爲如果你解析樹並在Python中進行預測,它將以Python速度運行並且不會(容易)縮放。您可能需要回退以定製C實現。
此代碼將完全按照您的要求進行操作。這裏,n
是X_train
中的數字意見。最後,(n,number_of_leaves)大小的數組leaf_observations
在每列中保存用於索引到X_train
中的布爾值以獲得每個葉中的觀察值。 leaf_observations
的每一列對應於leaves
中的一個元素,它具有樹葉的節點ID。
# get the nodes which are leaves
leaves = clf.tree_.children_left == -1
leaves = np.arange(0,clf.tree_.node_count)[leaves]
# loop through each leaf and figure out the data in it
leaf_observations = np.zeros((n,len(leaves)),dtype=bool)
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node]
thistree = [clf.tree_.feature.tolist()]
thistree.append(clf.tree_.threshold.tolist())
thistree.append(clf.tree_.children_left.tolist())
thistree.append(clf.tree_.children_right.tolist())
# get the decision rules for each leaf node & apply them
for (ind,nod) in enumerate(leaves):
# get the decision rules in numeric list form
rules = []
RevTraverseTree(thistree, nod, rules)
# convert & apply to the data by sequentially &ing the rules
thisnode = np.ones(n,dtype=bool)
for rule in rules:
if rule[1] == 1:
thisnode = np.logical_and(thisnode,X_train[:,rule[0]] > rule[2])
else:
thisnode = np.logical_and(thisnode,X_train[:,rule[0]] <= rule[2])
# get the observations that obey all the rules - they are the ones in this leaf node
leaf_observations[:,ind] = thisnode
這需要這裏定義的輔助功能,它遞歸從指定節點遍歷樹開始建立決策規則。
def RevTraverseTree(tree, node, rules):
'''
Traverase an skl decision tree from a node (presumably a leaf node)
up to the top, building the decision rules. The rules should be
input as an empty list, which will be modified in place. The result
is a nested list of tuples: (feature, direction (left=-1), threshold).
The "tree" is a nested list of simplified tree attributes:
[split feature, split threshold, left node, right node]
'''
# now find the node as either a left or right child of something
# first try to find it as a left node
try:
prevnode = tree[2].index(node)
leftright = -1
except ValueError:
# failed, so find it as a right node - if this also causes an exception, something's really f'd up
prevnode = tree[3].index(node)
leftright = 1
# now let's get the rule that caused prevnode to -> node
rules.append((tree[0][prevnode],leftright,tree[1][prevnode]))
# if we've not yet reached the top, go up the tree one more step
if prevnode != 0:
RevTraverseTree(tree, prevnode, rules)
我認爲一個簡單的選擇是使用訓練的決策樹的apply方法。從返回指數訓練樹,應用traindata,並建立一個查找表:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
iris = load_iris()
clf = DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
# apply training data to decision tree
leaf_indices = clf.apply(iris.data)
lookup = {}
# build lookup table
for i, leaf_index in enumerate(leaf_indices):
try:
lookup[leaf_index].append(iris.data[i])
except KeyError:
lookup[leaf_index] = []
lookup[leaf_index].append(iris.data[i])
# test
unkown_sample = [[4., 3.1, 6.1, 1.2]]
index = clf.apply(unkown_sample)
print(lookup[index[0]])
您是否嘗試過反傾銷的DecisionTree成graphviz的」 .DOT文件[1],然後用graph_tool加載[2]。 :
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from graph_tool.all import *
iris = load_iris()
clf = DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
tree.export_graphviz(clf,out_file='tree.dot')
#load graph with graph_tool and explore structure as you please
g = load_graph('tree.dot')
for v in g.vertices():
for e in v.out_edges():
print(e)
for w in v.out_neighbours():
print(w)
[1] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
你可以這樣美麗嗎?在http://scikit-learn.org/stable/_images/iris.svg? – eleanora
一旦輸出與export_graphiz類似的東西可以實現點-Tpng tree.dot -o tree.png。 –
我已經改變了一點什麼德魯博士公佈。
下面的代碼,給定一個數據幀和被裝配後的決策樹,返回:
values_path列表類通過路徑去)
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
def get_rules(dtc, df):
rules_list = []
values_path = []
values = dtc.tree_.value
def RevTraverseTree(tree, node, rules, pathValues):
'''
Traverase an skl decision tree from a node (presumably a leaf node)
up to the top, building the decision rules. The rules should be
input as an empty list, which will be modified in place. The result
is a nested list of tuples: (feature, direction (left=-1), threshold).
The "tree" is a nested list of simplified tree attributes:
[split feature, split threshold, left node, right node]
'''
# now find the node as either a left or right child of something
# first try to find it as a left node
try:
prevnode = tree[2].index(node)
leftright = '<='
pathValues.append(values[prevnode])
except ValueError:
# failed, so find it as a right node - if this also causes an exception, something's really f'd up
prevnode = tree[3].index(node)
leftright = '>'
pathValues.append(values[prevnode])
# now let's get the rule that caused prevnode to -> node
p1 = df.columns[tree[0][prevnode]]
p2 = tree[1][prevnode]
rules.append(str(p1) + ' ' + leftright + ' ' + str(p2))
# if we've not yet reached the top, go up the tree one more step
if prevnode != 0:
RevTraverseTree(tree, prevnode, rules, pathValues)
# get the nodes which are leaves
leaves = dtc.tree_.children_left == -1
leaves = np.arange(0,dtc.tree_.node_count)[leaves]
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node]
thistree = [dtc.tree_.feature.tolist()]
thistree.append(dtc.tree_.threshold.tolist())
thistree.append(dtc.tree_.children_left.tolist())
thistree.append(dtc.tree_.children_right.tolist())
# get the decision rules for each leaf node & apply them
for (ind,nod) in enumerate(leaves):
# get the decision rules
rules = []
pathValues = []
RevTraverseTree(thistree, nod, rules, pathValues)
pathValues.insert(0, values[nod])
pathValues = list(reversed(pathValues))
rules = list(reversed(rules))
rules_list.append(rules)
values_path.append(pathValues)
return (r, values_path)
它遵循一個例子:
df = pd.read_csv('df.csv')
X = df[df.columns[:-1]]
y = df['classification']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier(max_depth=2)
dtc.fit(X_train, y_train)
決策樹安裝已經生成以下樹:Decision Tree with width 2
在這一點上,只是調用該函數:
get_rules(df, dtc)
這是該函數返回的內容:
rules = [
['first <= 63.5', 'first <= 43.5'],
['first <= 63.5', 'first > 43.5'],
['first > 63.5', 'second <= 19.700000762939453'],
['first > 63.5', 'second > 19.700000762939453']
]
values = [
[array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 284., 57.]])],
[array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 352., 184.]])],
[array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 645., 620.]])],
[array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 287., 708.]])]
]
顯然,在值中,對於每條路徑,也有葉子值。
碰到類似的問題。你可能會發現我的答案[在這裏](和http://cn.tankoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/42227468#42227468)那裏提到的演練)有幫助。它使用0.18版本中的方法'decision_path'。如果對觀看訓練樣本感興趣,可在幾個地方用'X_train'替代'X_test'。 – Kevin