使用Spark 1.4.1爲嵌套的gz文件分割數據幀列的內容

我對使用Spark 1.4.1嵌套的gz文件分割數據幀列的內容有困難。我使用map函數來映射gz文件的屬性。使用Spark 1.4.1爲嵌套的gz文件分割數據幀列的內容

的數據是按以下格式：

"id": "tag:1234,89898", 
"actor": 
{ 
    "objectType": "person", 
    "id": "id:1234", 
    "link": "http:\wwww.1234.com/" 
}, 
"body",

我使用下面的代碼，以分割列和讀取數據文件。

val dataframe= sc.textFile(("filename.dat.gz") 
       .toString()) 
       .map(_.split(",")) 
       .map(r => {(r(0), r(1),r(2))}) 
       .toDF() 

dataframe.printSchema()

但結果是一樣的東西：

root 
--- _1: string (nullable = true) 
--- _2: string (nullable = true) 
--- _3: string (nullable = true)

這是不正確的格式。我想模式的格式如下：

----- id 
----- actor 
     ---objectType 
     ---id 
     ---link 
-----body

我在做錯事嗎？我需要使用此代碼對我的數據集進行一些處理並應用一些轉換。

來源

2016-03-25 user2122466

這個數據看起來像JSON。幸運的是，Spark支持使用Spark SQL輕鬆獲取JSON數據。從Spark Documentation：

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String, or a JSON file.

下面是從文檔

val sqlContext = new org.apache.spark.sql.SQLContext(sc) 

val myData = sc.textFile("myPath").map(s -> makeValidJSON(s)) 
val myNewData = sqlContext.read.json(myData) 

// The inferred schema can be visualized using the printSchema() method. 
myNewData.printSchema()

對於makeValidJSON功能，您只需要專注於一些字符串分析/操作策略，以獲得正確的例子的修改版本。

希望這會有所幫助。

來源

2016-03-25 00:51:25

嗨布萊恩，它是一個gz文件。我必須上傳gz文件，並且spark沒有對它的支持，數據看起來像JSON，但不是。 – user2122466

@ user2122466嗯，可能是一個黑客將它讀入，並將其映射到一個單位（每個記錄一個對象）'RDD [String]'，然後使用'DataFrameReader''json'函數將新的RDD 。這裏是DataFrameReader文檔的鏈接：https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameReader.html –

嗨布萊恩，謝謝你的信息。你能給我一個代碼的例子嗎？我以前從未使用過它。 – user2122466

使用Spark 1.4.1爲嵌套的gz文件分割數據幀列的內容

回答

相關問題