2017-07-24 93 views
-1

1)我必須構建一個代碼來讀取spark中的json文件。 我正在使用spark.read.json(「sample.json」)。 但即使是簡單的JSON文件中像下面Spark JSON讀取失敗

{ 
    {"id" : "1201", "name" : "satish", "age" : "25"} 
    {"id" : "1202", "name" : "krishna", "age" : "28"} 
    {"id" : "1203", "name" : "amith", "age" : "39"} 
    {"id" : "1204", "name" : "javed", "age" : "23"} 
    {"id" : "1205", "name" : "prudvi", "age" : "23"} 
} 

我得到錯誤的結果

+---------------+----+----+-------+ 
|_corrupt_record| age| id| name| 
+---------------+----+----+-------+ 
|    {|null|null| null| 
|   null| 25|1201| satish| 
|   null| 28|1202|krishna| 
|   null| 39|1203| amith| 
|   null| 23|1204| javed| 
|   null| 23|1205| prudvi| 
|    }|null|null| null| 
+---------------+----+----+-------+ 

我發現上面的例子here

2:)。此外,我不知道如何處理格式不正確的json文件,如下所示

{ 
    "title": "Person", 
    "type": "object", 
    "properties": { 
     "firstName": { 
      "type": "string" 
     }, 
     "lastName": { 
      "type": "string" 
     }, 
     "age": { 
      "description": "Age in years", 
      "type": "integer", 
      "minimum": 0 
     } 
    }, 
    "required": ["firstName", "lastName"] 
} 

我發現使用這些文件很困難。 有沒有應對的Java/Scala中除了JSON文件從火花

請幫

感謝任何連貫的方式!

回答

1

你的JSON文件應該是這樣的:

{"id" : "1201", "name" : "satish", "age" : "25"} 
{"id" : "1202", "name" : "krishna", "age" : "28"} 
{"id" : "1203", "name" : "amith", "age" : "39"} 
{"id" : "1204", "name" : "javed", "age" : "23"} 
{"id" : "1205", "name" : "prudvi", "age" : "23"} 

,代碼爲:

%spark.pyspark 

# sqlContext 
sq = sqlc 

# setup input 
file_json = "hdfs://mycluster/user/test/test.json" 

df = sqlc.read.json(file_json) 
df.registerTempTable("myfile") 

df2 = sqlc.sql("SELECT * FROM myfile") 

df2.show() 

輸出:

+---+----+-------+ 
|age| id| name| 
+---+----+-------+ 
| 25|1201| satish| 
| 28|1202|krishna| 
| 39|1203| amith| 
| 23|1204| javed| 
| 23|1205| prudvi| 
+---+----+-------+