帶星火的Spark決策樹

我正在通過以下網站閱讀決策樹分類部分。 http://spark.apache.org/docs/latest/mllib-decision-tree.html 帶星火的Spark決策樹

我建立了提供示例代碼到我的筆記本電腦，並試圖瞭解它的輸出。但我無法理解一點。以下是代碼和 sample_libsvm_data.txt可以在下面找到https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt

請參考輸出，並讓我知道我的意見是否正確。這是我的意見。

測試錯誤意味着它有大約95％的基於訓練的校正數據？

（最好奇的一個）如果功能434大於0.0那麼，它將基於吉尼的雜質爲1？例如，該值給出434：178那麼這將是1

from __future__ import print_function 
from pyspark import SparkContext 
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel 
from pyspark.mllib.util import MLUtils 

if __name__ == "__main__": 
    sc = SparkContext(appName="PythonDecisionTreeClassificationExample") 
    data = MLUtils.loadLibSVMFile(sc,'/home/spark/bin/sample_libsvm_data.txt') 
    (trainingData, testData) = data.randomSplit([0.7, 0.3]) 

    model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32) 

    predictions = model.predict(testData.map(lambda x: x.features)) 
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions) 
    testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count()/float(testData.count()) 

print('Test Error = ' + str(testErr)) 
print('Learned classification tree model:') 
print(model.toDebugString()) 

// =====Below is my output===== 
Test Error = 0.0454545454545 
Learned classification tree model: 
DecisionTreeModel classifier of depth 1 with 3 nodes 
If (feature 434 <= 0.0) 
    Predict: 0.0 
Else (feature 434 > 0.0) 
    Predict: 1.0

來源

2016-03-21 Jin Park

我相信你是正確的。是的，你的錯誤率大約是5％，所以你的算法對於你拒絕測試的30％數據的95％是正確的。根據你的輸出結果（我假定它是正確的，我沒有自己測試代碼），是的，唯一確定觀察類別的特徵是特徵434，如果它小於0，那麼它是0，否則1

來源

2016-03-21 12:48:07

爲什麼在Spark ML中，在訓練決策樹模型時，minInfoGain或每個節點的最小實例數不用於控制樹的增長？過度種植樹是非常容易的。

來源

2016-04-15 01:53:51 Jimmy

帶星火的Spark決策樹

回答

相關問題