我想讀取和轉換一個csv文件與json和非json列。 我設法讀取文件並將其放入數據框中。該模式是這樣的:與json和non-json列pyspark讀取文件
root
|-- 'id': string (nullable = true)
|-- 'score': string (nullable = true)
如果我去做df.take(2)
,我得到這些結果:
[Row('id'=u"'AF03DCAB-EE3F-493A-ACD9-4B98F548E6F3'", 'score'=u"{'topSpeed':15.00000,'averageSpeed':5.00000,'harshBraking':0,'harshAcceleration':0,'driverRating':null,'idlingScore':70,'speedingScore':70,'brakingScore':70,'accelerationScore':70,'totalEcoScore':70 }"), Row('id'=u"'1938A2B9-5EF2-413C-A7A3-C5F324FD4089'", 'score'=u"{'topSpeed':106.00000,'averageSpeed':71.00000,'harshBraking':0,'harshAcceleration':0,'driverRating':9,'idlingScore':76,'speedingScore':87,'brakingScore':86,'accelerationScore':82,'totalEcoScore':83 }")]
的id
列是一個「正常」的列和score
列包含數據json格式。 我想將json內容分解爲獨立的列,但也需要id列與其餘數據。 A具有的代碼工作片只爲分數列:
df = rawdata.select("'score'")
df1 = df.rdd # Convert to rdd
df2 = df1.flatMap(lambda x: x) # Flatten rows
dfJsonScore = sqlContext.read.json(df2)
dfJsonScore.printSchema()
dfJsonScore.take(3)
這給了我這樣的:
root
|-- accelerationScore: long (nullable = true)
|-- averageSpeed: double (nullable = true)
|-- brakingScore: long (nullable = true)
|-- driverRating: long (nullable = true)
|-- harshAcceleration: long (nullable = true)
|-- harshBraking: long (nullable = true)
|-- idlingScore: long (nullable = true)
|-- speedingScore: long (nullable = true)
|-- topSpeed: double (nullable = true)
|-- totalEcoScore: long (nullable = true)
[Row(accelerationScore=70, averageSpeed=5.0, brakingScore=70, driverRating=None, harshAcceleration=0, harshBraking=0, idlingScore=70, speedingScore=70, topSpeed=15.0, totalEcoScore=70),
Row(accelerationScore=82, averageSpeed=71.0, brakingScore=86, driverRating=9, harshAcceleration=0, harshBraking=0, idlingScore=76, speedingScore=87, topSpeed=106.0, totalEcoScore=83),
Row(accelerationScore=81, averageSpeed=74.0, brakingScore=85, driverRating=9, harshAcceleration=0, harshBraking=0, idlingScore=75, speedingScore=87, topSpeed=102.0, totalEcoScore=82)]
但我不能讓它結合工作與id列。
謝謝您的回答。不幸的是,我們仍然在pyspark 2.0上,所以我將不得不尋找替代解決方案 – Chantal
請參閱更新回答 – Mariusz
感謝您的更新。這對我正在處理的案例非常有用。 – Chantal