Spark JSON讀取失敗

-1

1）我必須構建一個代碼來讀取spark中的json文件。我正在使用spark.read.json（「sample.json」）。但即使是簡單的JSON文件中像下面Spark JSON讀取失敗

{ 
    {"id" : "1201", "name" : "satish", "age" : "25"} 
    {"id" : "1202", "name" : "krishna", "age" : "28"} 
    {"id" : "1203", "name" : "amith", "age" : "39"} 
    {"id" : "1204", "name" : "javed", "age" : "23"} 
    {"id" : "1205", "name" : "prudvi", "age" : "23"} 
}

我得到錯誤的結果

+---------------+----+----+-------+ 
|_corrupt_record| age| id| name| 
+---------------+----+----+-------+ 
|    {|null|null| null| 
|   null| 25|1201| satish| 
|   null| 28|1202|krishna| 
|   null| 39|1203| amith| 
|   null| 23|1204| javed| 
|   null| 23|1205| prudvi| 
|    }|null|null| null| 
+---------------+----+----+-------+

我發現上面的例子here

2：）。此外，我不知道如何處理格式不正確的json文件，如下所示

{ 
    "title": "Person", 
    "type": "object", 
    "properties": { 
     "firstName": { 
      "type": "string" 
     }, 
     "lastName": { 
      "type": "string" 
     }, 
     "age": { 
      "description": "Age in years", 
      "type": "integer", 
      "minimum": 0 
     } 
    }, 
    "required": ["firstName", "lastName"] 
}

我發現使用這些文件很困難。有沒有應對的Java/Scala中除了JSON文件從火花

請幫

感謝任何連貫的方式！

來源

2017-07-24 Vikrant Sonawane

你的JSON文件應該是這樣的：

{"id" : "1201", "name" : "satish", "age" : "25"} 
{"id" : "1202", "name" : "krishna", "age" : "28"} 
{"id" : "1203", "name" : "amith", "age" : "39"} 
{"id" : "1204", "name" : "javed", "age" : "23"} 
{"id" : "1205", "name" : "prudvi", "age" : "23"}

，代碼爲：

%spark.pyspark 

# sqlContext 
sq = sqlc 

# setup input 
file_json = "hdfs://mycluster/user/test/test.json" 

df = sqlc.read.json(file_json) 
df.registerTempTable("myfile") 

df2 = sqlc.sql("SELECT * FROM myfile") 

df2.show()

輸出：

+---+----+-------+ 
|age| id| name| 
+---+----+-------+ 
| 25|1201| satish| 
| 28|1202|krishna| 
| 39|1203| amith| 
| 23|1204| javed| 
| 23|1205| prudvi| 
+---+----+-------+

來源

2017-07-24 19:45:20 tbone

Spark JSON讀取失敗

回答

相關問題