1
在CDH 5.4,我試圖用建立在Twitter上分析演示:蜂巢錯誤而查詢包含水槽流外部表
- 水槽用於捕捉鳴叫到HDFS文件夾
- 蜂巢查詢使用Hive-Serde的推文
步驟1成功。我可以看到這些推文正在被捕獲並正確導向到所需的HDFS文件夾。我觀察到一個臨時文件被創建第一個,然後轉換爲永久文件:
-rw-r--r-- 3 root hadoop 7548 2015-10-06 06:39 /user/flume/tweets/FlumeData.1444127932782
-rw-r--r-- 3 root hadoop 10034 2015-10-06 06:39 /user/flume/tweets/FlumeData.1444127932783.tmp
我使用下面的表聲明:
CREATE EXTERNAL TABLE tweets(
id bigint,
created_at string,
lang string,
source string,
favorited boolean,
retweet_count int,
retweeted_status
struct<text:string,user:struct<screen_name:string,name:string>>,
entities struct<urls:array<struct<expanded_url:string>>,
user_mentions:array<struct<screen_name:string,name:string>>,
hashtags:array<struct<text:string>>>,
text string,
user struct<location:string,geo_enabled:string,screen_name:string,name:string,friends_count:int,followers_count:int,statuses_count:int,verified:boolean,utc_offset:int,time_zone:string>,
in_reply_to_screen_name string)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://master.ds.com:8020/user/flume/tweets';
但是,當我查詢這個表,我得到下面的錯誤:
hive> select count(*) from tweets;
Ended Job = job_1443526273848_0140 with errors
...
Diagnostic Messages for this Task:
Error: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreation
... 11 more
Caused by: java.io.FileNotFoundException: File does not exist: /user/flume/tweets/FlumeData.1444128601078.tmp
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 1.19 sec HDFS Read: 10492 HDFS Write: 0 FAIL
我認爲這個問題涉及到臨時文件,該文件由蜂巢查詢催生了地圖,減少工作,而不是被讀取。是否可以進行一些變通或配置更改以成功處理此問題?