-2

我正在嘗試使用交叉驗證來執行隨機森林分類器並評估模型。我使用pySpark工作。輸入的CSV文件作爲Spark DataFrame格式加載。 但我在構建模型時遇到了一個問題。pyspark.sql.utils.IllegalArgumentException:u'Field「功能」不存在。

以下是代碼。

from pyspark import SparkContext 
from pyspark.sql import SQLContext 
from pyspark.ml import Pipeline 
from pyspark.ml.classification import RandomForestClassifier 
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator 
from pyspark.mllib.evaluation import BinaryClassificationMetrics 
sc = SparkContext() 
sqlContext = SQLContext(sc) 
trainingData =(sqlContext.read 
     .format("com.databricks.spark.csv") 
     .option("header", "true") 
     .option("inferSchema", "true") 
     .load("/PATH/CSVFile")) 
numFolds = 10 
rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="V5409",featuresCol="features",seed=42) 
evaluator = MulticlassClassificationEvaluator().setLabelCol("V5409").setPredictionCol("prediction").setMetricName("accuracy") 
paramGrid = ParamGridBuilder().build() 

pipeline = Pipeline(stages=[rf]) 
paramGrid=ParamGridBuilder().build() 
crossval = CrossValidator(
    estimator=pipeline, 
    estimatorParamMaps=paramGrid, 
    evaluator=evaluator, 
    numFolds=numFolds) 
model = crossval.fit(trainingData) 
print accuracy 

我得到以下錯誤

Traceback (most recent call last): 
    File "SparkDF.py", line 41, in <module> 
    model = crossval.fit(trainingData) 
    File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit 
    return self._fit(dataset) 
    File "/usr/local/spark-2.1.1/python/pyspark/ml/tuning.py", line 236, in _fit 
    model = est.fit(train, epm[j]) 
    File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit 
    return self._fit(dataset) 
    File "/usr/local/spark-2.1.1/python/pyspark/ml/pipeline.py", line 108, in _fit 
    model = stage.fit(dataset) 
    File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit 
    return self._fit(dataset) 
    File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 236, in _fit 
    java_model = self._fit_java(dataset) 
    File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 233, in _fit_java 
    return self._java_obj.fit(dataset._jdf) 
    File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__ 
    answer, self.gateway_client, self.target_id, self.name) 
    File "/usr/local/spark-2.1.1/python/pyspark/sql/utils.py", line 79, in deco 
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 
pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.' 
[email protected]:~/workspace/RandomForest_CV$ 

請幫我在pySpark來解決這個問題。 謝謝。

我在這裏顯示數據集的詳細信息。不,我沒有專門的專欄。下面是trainingData.take(5)的輸出,它顯示前5行數據集。 V4366 = 0.0,V4460 = 0.232,V4916 = -0.017,V1495 = -0.104,V1639 = 0.005,V1967 = -0.008,V3049 = 0.177,V3746 = -0.675,V3869 = -3.451,V524 = 0.004,V5409 = 0),行(V4366 = 0.0,V4460 = 0.111,V4916 = -0.003,V1495 = -0.137,V1639 = 0.001,V1967 = -0.01,V3049 = 0.01,V3746 = -0.867,V3869 = -2.759, V5409 = 0),行(V4366 = 0.0,V4460 = -0.391,V4916 = -0.003,V1495 = -0.155,V1639 = -0.006,V1967 = -0.019,V3049 = -0.706,V3746 = 0.166,V3869 V4366 = 0.0,V4460 = 0.098,V4916 = -0.012,V1495 = -0.108,V1639 = 0.005,V1967 = -0.002,V3049 = 0.033,V3746 = -0.787, V3869 = -0.926,V524 = 0.002,V5409 = 0),行(V4366 = 0.0,V4460 = 0.026,V4916 = -0.004,V1495 = -0.139,V1639 = 0.003,V1967 = -0.006,V3049 = -0.045,V3746 = -0.208,V3869 = -0.782,V524 = 0.001,V5409 = 0)]

其中V433至V524是壯舉數目字。 V5409是類別標籤。

+3

我們需要一個數據的想法 - 請更新文章'trainingData.show()'的輸出。您的csv文件中是否有名爲'features'的列? – desertnaut

+0

沒有功能欄不存在。我在數據中有屬性名稱。 –

+0

我已更新我的問題。謝謝 –

回答

0

Spark Spark數據框不像Spark ML中那樣使用;您的所有功能需要爲單個列中的向量,通常名爲features。下面是如何使用您所提供的一個例子,5行做到這一點:

spark.version 
# u'2.2.0' 

from pyspark.sql import Row 
from pyspark.ml.linalg import Vectors 

# your sample data: 
temp_df = spark.createDataFrame([Row(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V3746=-0.675, V3869=-3.451, V524=0.004, V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.01, V3746=-0.867, V3869=-2.759, V524=0.0, V5409=0), Row(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V3049=-0.706, V3746=0.166, V3869=0.189, V524=0.001, V5409=0), Row(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V1967=-0.002, V3049=0.033, V3746=-0.787, V3869=-0.926, V524=0.002, V5409=0), Row(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V1967=-0.006, V3049=-0.045, V3746=-0.208, V3869=-0.782, V524=0.001, V5409=0)]) 

trainingData=temp_df.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"]) 
trainingData.show() 
# +--------------------+-----+ 
# |   features|label| 
# +--------------------+-----+ 
# |[-0.104,0.005,-0....| 0| 
# |[-0.137,0.001,-0....| 0| 
# |[-0.155,-0.006,-0...| 0| 
# |[-0.108,0.005,-0....| 0| 
# |[-0.139,0.003,-0....| 0| 
# +--------------------+-----+ 

之後,您的管道應該運行正常(我假設你的確具有多類分類,因爲你的樣本只包含0作爲標籤),只有在你​​和evaluator如下更改標籤列:

rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="label",featuresCol="features",seed=42) 
evaluator = MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy") 

最後,print accuracy將無法​​工作 - 你需要model.avgMetrics代替。

+0

當我給spark.createDataFrame()如上所述,它顯示NameError:name'spark'未定義。如何解決這個問題?謝謝你的回答。這非常有用。 –

+0

@LokeswariVenkataramana可能你使用的是舊版本的Spark(1.x)。您不需要該命令 - 只需像在代碼中那樣讀取初始csv文件,然後在名爲temp_df的數據框中讀取,然後按照我的說明繼續定義'trainingData'。 – desertnaut

+1

添加下面的代碼解決了NameError:name Spark沒有被定義。 from pyspark.context從pyspark.sql導入SparkContext 。會話導入SparkSession sc = SparkContext('local') spark = SparkSession(sc) –