我正在嘗試使用交叉驗證來執行隨機森林分類器並評估模型。我使用pySpark工作。輸入的CSV文件作爲Spark DataFrame格式加載。 但我在構建模型時遇到了一個問題。pyspark.sql.utils.IllegalArgumentException:u'Field「功能」不存在。
以下是代碼。
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
sc = SparkContext()
sqlContext = SQLContext(sc)
trainingData =(sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/PATH/CSVFile"))
numFolds = 10
rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="V5409",featuresCol="features",seed=42)
evaluator = MulticlassClassificationEvaluator().setLabelCol("V5409").setPredictionCol("prediction").setMetricName("accuracy")
paramGrid = ParamGridBuilder().build()
pipeline = Pipeline(stages=[rf])
paramGrid=ParamGridBuilder().build()
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
model = crossval.fit(trainingData)
print accuracy
我得到以下錯誤
Traceback (most recent call last):
File "SparkDF.py", line 41, in <module>
model = crossval.fit(trainingData)
File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/tuning.py", line 236, in _fit
model = est.fit(train, epm[j])
File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/pipeline.py", line 108, in _fit
model = stage.fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark-2.1.1/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.'
[email protected]:~/workspace/RandomForest_CV$
請幫我在pySpark來解決這個問題。 謝謝。
我在這裏顯示數據集的詳細信息。不,我沒有專門的專欄。下面是trainingData.take(5)的輸出,它顯示前5行數據集。 V4366 = 0.0,V4460 = 0.232,V4916 = -0.017,V1495 = -0.104,V1639 = 0.005,V1967 = -0.008,V3049 = 0.177,V3746 = -0.675,V3869 = -3.451,V524 = 0.004,V5409 = 0),行(V4366 = 0.0,V4460 = 0.111,V4916 = -0.003,V1495 = -0.137,V1639 = 0.001,V1967 = -0.01,V3049 = 0.01,V3746 = -0.867,V3869 = -2.759, V5409 = 0),行(V4366 = 0.0,V4460 = -0.391,V4916 = -0.003,V1495 = -0.155,V1639 = -0.006,V1967 = -0.019,V3049 = -0.706,V3746 = 0.166,V3869 V4366 = 0.0,V4460 = 0.098,V4916 = -0.012,V1495 = -0.108,V1639 = 0.005,V1967 = -0.002,V3049 = 0.033,V3746 = -0.787, V3869 = -0.926,V524 = 0.002,V5409 = 0),行(V4366 = 0.0,V4460 = 0.026,V4916 = -0.004,V1495 = -0.139,V1639 = 0.003,V1967 = -0.006,V3049 = -0.045,V3746 = -0.208,V3869 = -0.782,V524 = 0.001,V5409 = 0)]
其中V433至V524是壯舉數目字。 V5409是類別標籤。
我們需要一個數據的想法 - 請更新文章'trainingData.show()'的輸出。您的csv文件中是否有名爲'features'的列? – desertnaut
沒有功能欄不存在。我在數據中有屬性名稱。 –
我已更新我的問題。謝謝 –