1
我有了這樣的內容的文件:如何將JSON文件的一部分加載到DataFrame中?
a {"field1":{"field2":"val","field3":"val"...}}
b {"field1":{"field2":"val","field3":"val"...}}
...
,我能夠將文件加載到像這樣的表:
╔════╦════════════════════════════════════════════════
║ ID ║ JSON ║
╠════╬════════════════════════════════════════════════
║ a ║ {"field1":{"field2":"val","field3":"val"...}} ║
║ b ║ {"field1":{"field2":"val","field3":"val"...}} ║
╚════╩════════════════════════════════════════════════
我怎樣才能使它變成這樣呢?
╔════╦═════════════════════════════════════
║ ID ║ field2 ║field3 ║... ║... ║
╠════╬═════════════════════════════════════
║ a ║ val ║val ║.. ║... ║
║ b ║ val ║val ║.. ║... ║
╚════╩═════════════════════════════════════
由於它是一個局部的JSON文件,我不能這樣做read.json
我看到這個帖子太convert lines of json in RDD to dataframe in apache Spark 但我的JSON字符串是一個嵌套的JSON,這是很長,所以我不想列出所有田野。 我也試過
#solr_data is the data frame made from the file, and json is the column with the json string, session is a SparkSession
json_table = solr_data.select(solr_data["json"]).rdd.map(lambda x:session.read.json(x))
這樣做效果不好。我不能show()
或collect()
,createDataFrame()
也沒有爲此工作。
確切的內容是怎樣的?介紹發佈樣本?什麼是「長串」,它們與val的關係如何? – philantrovert