如何在火花中操縱我的數據框？

-1

我有一個嵌套的json rdd流來自kafka主題。的數據是這樣的：如何在火花中操縱我的數據框？

{ 
    "time":"sometext1","host":"somehost1","event": 
    {"category":"sometext2","computerName":"somecomputer1"} 
}

我把這個變成一個數據幀和架構看起來像

root 
|-- event: struct (nullable = true) 
| |-- category: string (nullable = true) 
| |-- computerName: string (nullable = true) 
|-- time: string (nullable = true) 
|-- host: string (nullable = true)

我試着將它與這樣

的模式保存到一個蜂巢表上HDFS

category:string 
computerName:string 
time:string 
host:string

這是我第一次使用spark和scala。我會appretiate是否有人可以幫助我。感謝

來源

2016-10-16 Riyan Mohammed

// Creating Rdd  
val vals = sc.parallelize(
    """{"time":"sometext1","host":"somehost1","event": {"category":"sometext2","computerName":"somecomputer1"}}""" :: 
    Nil) 

// Creating Schema 
val schema = (new StructType) 
    .add("time", StringType) 
    .add("host", StringType) 
    .add("event", (new StructType) 
    .add("category", StringType) 
    .add("computerName", StringType)) 

import sqlContext.implicits._ 
val jsonDF = sqlContext.read.schema(schema).json(vals)

jsonDF.printSchema

root 
|-- time: string (nullable = true) 
|-- host: string (nullable = true) 
|-- event: struct (nullable = true) 
| |-- category: string (nullable = true) 
| |-- computerName: string (nullable = true) 

// selecting columns 
val df = jsonDF.select($"event.*",$"time", 
    $"host")

df.printSchema

root 
|-- category: string (nullable = true) 
|-- computerName: string (nullable = true) 
|-- time: string (nullable = true) 
|-- host: string (nullable = true)

df.show

+---------+-------------+---------+---------+ 
| category| computerName|  time|  host| 
+---------+-------------+---------+---------+ 
|sometext2|somecomputer1|sometext1|somehost1| 
+---------+-------------+---------+---------+

來源

2016-10-17 00:51:21

如何在火花中操縱我的數據框？

回答

相關問題