Spark RDD的模式定義

我是Python Spark的新手。下面我有火花數據幀& JSON對象Spark RDD的模式定義

df = sqlContext.read.load("result.json", format="json")

JSON對象：

df.collect() 

[Row(Dorothy=[u'CA', u'F', u'1910', u'220'], Frances=[u'CA', u'F', u'1910', u'134'], Helen=[u'CA', u'F', u'1910', u'239'], Margaret=[u'CA', u'F', u'1910', u'163'], Mary=[u'CA', u'F', u'1910', u'295'])]

當我嘗試字段名分配到上述數值

df.select(Row("Name" =["State","Gender","Year","Count"])).write.save("result.json",format = 'json')

我收到錯誤，提示以下錯誤：

。你能幫到如何定義架構的datafrmae

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist

來源

2016-05-16 ytasfeb15

上面裝載你已經有了模式的JSON文件後，所以你可以做df.printSchema()，所以你不需要使用Row類。

所以，你應該能夠當您使用Row類傳遞給它的鍵值對，例如命名的參數做類似的東西

df.select(df['State'], df['Gender'], df['Year'], df['Count'])

或

df.select('State', 'Gender', 'Year', 'Count')

rows = [Row(name='John', age=10)]

其用於構建數據幀用的行的列表例如

df = sqlContext.createDataFrame(rows)

來源

2016-05-16 02:36:33

Spark RDD的模式定義

回答

相關問題