1
創建數據幀時
我有下面的代碼,我想創建從一個PipelinedRDD` DataFrame
:錯誤從RDD
print type(simulation)
sqlContext.createDataFrame(simulation)
的print
語句打印此:
<class 'pyspark.rdd.PipelinedRDD'>
然而,下一行我得到這個錯誤:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
該錯誤已經此痕跡:
---> 13 sqlContext.createDataFrame(simulation)
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
--> 423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
425 rdd, schema = self._createFromLocal(data, schema)
/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
308 """
309 if schema is None or isinstance(schema, (list, tuple)):
--> 310 struct = self._inferSchema(rdd, samplingRatio)
我得到'NameError:全局名稱「StructType」沒有defined'。我需要導入任何圖書館嗎? – octavian
是的。你需要這個:from pyspark.sql.types import StructType,StructField,StringType,IntegerType – Sorin
你是否試過只指定samplingRatio? – Sorin