2013-01-03 127 views
4

我對Hive和ElasticMapreduce相當陌生,目前我堅持一個特定的問題。 當運行在數十億JSON對象行的表中的蜂巢聲明中,MapReduce工作儘快只有那些行的一個崩潰無效/畸形的JSON。Hive/ElasticMapreduce:如何讓JsonSerDe忽略格式錯誤的JSON?

例外:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {"ip":"39488130","cdate":"2012-08-09","cdate_ts":"2012-08-09 17:06:41","country":"SA","city":"Riyadh","mid":"6666276582211270592","osversion":"5.1. 
1 
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161) 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) 
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) 
at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) 
at org.apache.hadoop.mapred.Child.main(Child.java:249) 
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {"ip":"39488130","cdate":"2012-08-09","cdate_ts":"2012-08-09 17:06:41","country":"SA","city":"Riyadh","mid":"6666276582211270592","osversion":"5.1.1 
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524) 
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143) 
... 8 more 
Caused by: com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Unterminated string near 
at com.google.gson.Streams.parse(Streams.java:51) 
at com.google.gson.JsonParser.parse(JsonParser.java:83) 
at com.google.gson.JsonParser.parse(JsonParser.java:58) 
at com.google.gson.JsonParser.parse(JsonParser.java:44) 
at com.amazon.elasticmapreduce.JsonSerde.deserialize(Unknown Source) 
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:510) 
... 9 more 
Caused by: com.google.gson.stream.MalformedJsonException: Unterminated string near 
at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110) 
at com.google.gson.stream.JsonReader.nextString(JsonReader.java:967) 
at com.google.gson.stream.JsonReader.nextValue(JsonReader.java:802) 
at com.google.gson.stream.JsonReader.objectValue(JsonReader.java:782) 
at com.google.gson.stream.JsonReader.quickPeek(JsonReader.java:377) 
at com.google.gson.stream.JsonReader.peek(JsonReader.java:340) 
at com.google.gson.Streams.parseRecursive(Streams.java:60) 
at com.google.gson.Streams.parseRecursive(Streams.java:83) 
at com.google.gson.Streams.parse(Streams.java:40) 
... 14 more 

我創建這樣我的表:

CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
column1 string, 
column2 string 
) 
PARTITIONED BY (year string, month string) 
ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' 
WITH SERDEPROPERTIES ('paths'='c1, c2') 
LOCATION 's3://mybucket/table1'; 

我能做些什麼來防止崩潰?忽略格式錯誤的JSON對象/字符串會很好,因爲它的數十億格式錯誤。

感謝您的幫助提前。 最佳,的Sascha

回答

3

通過改變行格式中使用的類,並添加「畸形」屬性,你可以讓你創建表的格式不正確JSONs工作的:

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ("ignore.malformed.json" = "true") 
LOCATION ... 

,在 '蜂巢-site.xml中' 或 'ADD JAR' 蜂房指令使用 'hive.aux.jars.path' 屬性的JAR。你可以找到JAR here,或從this source編譯它。

-2

基本上,上述錯誤發生,因爲JSON無效的字符串。 嘗試解決此問題。

Othere明智的,以避免崩潰您的應用程序,趕上try catch塊是例外,proced further.So您的應用程序不會崩潰。

0

從Apache JsonSerDe似乎忽略畸形的JSON字符串... http://code.google.com/p/hive-json-serde/

+0

該項目似乎不支持Hive版本大於0.10的版本: https://code.google.com/p/hive-json-serde/issues/detail?id=15&colspec=ID%20Stars%20Type% 20Status%20Priority%20Milestone%20Owner%20Summary –