2016-10-16 51 views
-1

我有一個嵌套的json rdd流來自kafka主題。 的數據是這樣的:如何在火花中操縱我的數據框?

{ 
    "time":"sometext1","host":"somehost1","event": 
    {"category":"sometext2","computerName":"somecomputer1"} 
} 

我把這個變成一個數據幀和架構看起來像

root 
|-- event: struct (nullable = true) 
| |-- category: string (nullable = true) 
| |-- computerName: string (nullable = true) 
|-- time: string (nullable = true) 
|-- host: string (nullable = true) 

我試着將它與這樣

的模式保存到一個蜂巢表上HDFS
category:string 
computerName:string 
time:string 
host:string 

這是我第一次使用spark和scala。我會appretiate是否有人可以幫助我。 感謝

回答

0
// Creating Rdd  
val vals = sc.parallelize(
    """{"time":"sometext1","host":"somehost1","event": {"category":"sometext2","computerName":"somecomputer1"}}""" :: 
    Nil) 

// Creating Schema 
val schema = (new StructType) 
    .add("time", StringType) 
    .add("host", StringType) 
    .add("event", (new StructType) 
    .add("category", StringType) 
    .add("computerName", StringType)) 

import sqlContext.implicits._ 
val jsonDF = sqlContext.read.schema(schema).json(vals) 

jsonDF.printSchema

root 
|-- time: string (nullable = true) 
|-- host: string (nullable = true) 
|-- event: struct (nullable = true) 
| |-- category: string (nullable = true) 
| |-- computerName: string (nullable = true) 

// selecting columns 
val df = jsonDF.select($"event.*",$"time", 
    $"host") 

df.printSchema

root 
|-- category: string (nullable = true) 
|-- computerName: string (nullable = true) 
|-- time: string (nullable = true) 
|-- host: string (nullable = true) 

df.show

+---------+-------------+---------+---------+ 
| category| computerName|  time|  host| 
+---------+-------------+---------+---------+ 
|sometext2|somecomputer1|sometext1|somehost1| 
+---------+-------------+---------+---------+