我使用的是標準的(字符串索引+一個熱編碼器+隨機森林)火花管道得到的列名,如下圖所示pyspark隨機森林功能的重要性:如何從列編號
labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data)
string_feature_indexers = [
StringIndexer(inputCol=x, outputCol="int_{0}".format(x)).fit(data)
for x in char_col_toUse_names
]
onehot_encoder = [
OneHotEncoder(inputCol="int_"+x, outputCol="onehot_{0}".format(x))
for x in char_col_toUse_names
]
all_columns = num_col_toUse_names + bool_col_toUse_names + ["onehot_"+x for x in char_col_toUse_names]
assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer] + string_feature_indexers + onehot_encoder + [assembler, rf, labelConverter])
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3)
cvModel = crossval.fit(trainingData)
現在經過適合我可以使用cvModel.bestModel.stages[-2].featureImportances
獲得隨機森林和特徵重要性,但是這不會給我功能/列名稱,而只是功能號碼。
下面我得到的是:
print(cvModel.bestModel.stages[-2].featureImportances)
(1446,[3,4,9,18,20,103,766,981,983,1098,1121,1134,1148,1227,1288,1345,1436,1444],[0.109898803421,0.0967396441648,4.24568235244e-05,0.0369705839109,0.0163489685127,3.2286694534e-06,0.0208192703688,0.0815822887175,0.0466903663708,0.0227619959989,0.0850922269211,0.000113388896956,0.0924779490403,0.163835022713,0.118987129392,0.107373548367,3.35577640585e-05,0.000229569946193])
我該如何映射回一些列名或列名+值格式?
基本上可以獲得隨機森林和列名的重要性。
是的,但是您錯過了在stringindexer/onehotencoder之後列名更改的點。由彙編程序組合的那個,我想映射到它們。我肯定可以做到這一點,但我更關心spark(ml)是否有一些更短的方式,像scikit學習一樣:) – Abhishek
啊好吧我的壞。但是你們長遠來說還是有效的。我不認爲目前存在短期解決方案。 Spark ML API不像scikit學習的那樣強大和冗長。 –
是的,我知道:),只是想保持這個問題的建議:)。謝謝Dat – Abhishek