如何使用MLlib運行此決策樹？

-3

我來自使用Scikit學習運行ML算法，所以MLlib是相當新的。據說，我確實在最近一次演示中使用了Cloudera的github，我留下了一個問題。如何使用MLlib運行此決策樹？

說我正在使用決策樹進行二進制分類。我想預測一個物體是蘋果還是桔子。進入特徵部分的兩個參數是列表[x（float），y（binary）]。 X將指示物體的重量，並且y將指示0或1（平滑或顛簸）。

然後我有一個也是二進制的列表（0 = apple，1 = orange）。當我使用Scikit學習，我將它們保存到這樣的事情：

features_list = [[140, 0], [150, 0], [160, 1], [170, 1]] 
labels = [0, 0, 1, 1]

在此，每個標籤0或1將對應於features_list的項目。因此，第一個0是功能標籤[140 0]等

現在，當我去訓練我的模型，我的代碼看起來是這樣的：

clf = tree.DecisionTreeClassifier() 
clf = clf.fit(ml_list, labels)

當我將做一個預測，我會寫的代碼是這樣的：

print(clf.predict([180, 1])

當在MLlib文件看，似乎參數是「labelscol」和「featurescol」。我嘗試將我的ml_list和標籤傳遞給這些參數，並引發錯誤。

我的問題是，有沒有什麼辦法可以運行ML算法，就像我通過使用MLlib使用這兩個列表一樣進行Scikit學習？任何幫助將是偉大的！

來源

2017-07-30 rmahesh

你見過這樣的： - https://spark.apache.org/docs/latest/mllib-decision-tree。 html＃examples –

@VivekKumar我已經看到了，但我很困惑我是否可以將列名而不是列名作爲特徵/標籤。 – rmahesh

爲什麼不找出來！ –

您應該使用ML（即基於DataFrame的API）而不是MLlib，因爲後者是針對deprecation。

spark.version 
# u'2.2.0' 

from pyspark.ml.linalg import Vectors 
from pyspark.ml.classification import DecisionTreeClassifier 

features_list = [[140, 0], [150, 0], [160, 1], [170, 1]] 
labels = [0, 0, 1, 1] 

dd = [(labels[i], Vectors.dense(features_list[i])) for i in range(len(labels))] 
dd 
# [(0, DenseVector([140.0, 0.0])), 
# (0, DenseVector([150.0, 0.0])), 
# (1, DenseVector([160.0, 1.0])), 
# (1, DenseVector([170.0, 1.0]))] 

df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"]) 

dt = DecisionTreeClassifier(maxDepth=2, labelCol="label") 
model = dt.fit(df) 

# predict on the training set 
model.transform(df).show() # 'transform' instead of 'predict' in Spark ML 
# +-----+-----------+-------------+-----------+----------+ 
# |label| features|rawPrediction|probability|prediction|  
# +-----+-----------+-------------+-----------+----------+ 
# | 0|[140.0,0.0]| [2.0,0.0]| [1.0,0.0]|  0.0| 
# | 0|[150.0,0.0]| [2.0,0.0]| [1.0,0.0]|  0.0| 
# | 1|[160.0,1.0]| [0.0,2.0]| [0.0,1.0]|  1.0| 
# | 1|[170.0,1.0]| [0.0,2.0]| [0.0,1.0]|  1.0|  
# +-----+-----------+-------------+-----------+----------+ 

# predict on a test set: 
test = spark.createDataFrame([(Vectors.dense(180, 1),)], ["features"]) 
model.transform(test).show() 
# +-----------+-------------+-----------+----------+ 
# | features|rawPrediction|probability|prediction| 
# +-----------+-------------+-----------+----------+ 
# |[180.0,1.0]| [0.0,2.0]| [0.0,1.0]|  1.0| 
# +-----------+-------------+-----------+----------+

編輯：這裏是如何初始化星火：

from pyspark import SparkContext, SparkConf 
from pyspark.sql import SparkSession 
conf = SparkConf() 
sc = SparkContext(conf=conf) 
spark = SparkSession.builder.config(conf=conf).getOrCreate()

來源

2017-08-01 19:54:15 desertnaut

謝謝。我得到一個錯誤，說「NameError：name'spark'未定義」。我對Spark很新，如何定義這個？ – rmahesh

@rmahesh請參閱編輯最小初始化示例。讓我知道如果你有進一步的錯誤，因爲我在Databricks雲中處理這些事情，Spark已經初始化，我可能忘記了一些東西...... – desertnaut

感謝您的編輯！我相信編輯幾乎完全解決了我的問題。但是我在'df = spark.createDataFrame'語句中遇到了這個錯誤：AttributeError：'Builder'對象沒有屬性'createDataFrame'是否有我應該預先初始化的東西？ – rmahesh

如何使用MLlib運行此決策樹？

回答

相關問題