-1

我有一個已經從一些json創建的rdd,rdd中的每個記錄都包含鍵/值對。我RDD的樣子:如何將JSON的RDD轉換爲Dataframe?

myRdd.foreach(println) 
       {"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1], 
       {"sequence":153,"id":8697389197662617,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637852762},1], 
       {"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1], 
       {"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1], 
       {"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1], 
       {"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}] 

我向每個記錄轉化爲行中的火花數據幀,在trackingInfo嵌套領域應該有自己的列和type名單應該是自己的列也。

到目前爲止,我已經厭倦了使用的情況下,類把它分解:

case class Event(
    sequence: String, 
    id: String, 
    trackingInfo:String, 
    location:String, 
    row:String, 
    trackId: String, 
    listrequestId: String, 
    videoId:String, 
    rank: String, 
    requestId: String, 
    `type`:String, 
    time: String) 

val dataframeRdd = myRdd.map(line => line.split(",")). 
    map(array => Event(
     array(0).split(":")(1), 
     array(1).split(":")(1), 
     array(2).split(":")(1), 
     array(3).split(":")(1), 
     array(4).split(":")(1), 
     array(5).split(":")(1), 
     array(6).split(":")(1), 
     array(7).split(":")(1), 
     array(8).split(":")(1), 
     array(9).split(":")(1), 
     array(10).split(":")(1), 
     array(11).split(":")(1) 
     )) 

但是我一直得到java.lang.ArrayIndexOutOfBoundsException: 1錯誤。

這樣做的最好方法是什麼?正如你所看到的,記錄號碼5在某些屬性的排序上略有不同。是否可以解析基於屬性名稱,而不是分裂的「」等

我使用的火花1.6.x的

感謝

回答

1

您的json rdd似乎無效jsons。你需要將它們轉換爲有效的jsons

val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}")) 

,那麼你可以使用sqlContext讀取有效rddjsonsdataframe作爲

val df = sqlContext.read.json(validJsonRdd) 

這應該給你數據框(我用的無效JSON你提供的問題)

+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ 
|id    |sequence|time  |trackingInfo                                |type     | 
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ 
|8697344444103393|89  |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]| 
|8697389197662617|153  |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]| 
|8697389381205360|155  |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]| 
|8697374208897843|136  |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]| 
|8697413135394406|189  |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830]             |[Play, Action, Session]| 
|8697373887446384|130  |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]| 
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ 

並且數據幀的架構是

root 
|-- id: long (nullable = true) 
|-- sequence: long (nullable = true) 
|-- time: long (nullable = true) 
|-- trackingInfo: struct (nullable = true) 
| |-- listId: string (nullable = true) 
| |-- location: string (nullable = true) 
| |-- rank: long (nullable = true) 
| |-- requestId: string (nullable = true) 
| |-- row: long (nullable = true) 
| |-- trackId: long (nullable = true) 
| |-- videoId: long (nullable = true) 
|-- type: array (nullable = true) 
| |-- element: string (containsNull = true) 

我希望答案是有幫助的

0

您可以使用sqlContext.read.json(myRDD。 map(_._ 2))將json讀入數據幀

相關問題