0
,我讀了「的Hadoop權威指南」這樣的代碼:蜂巢多重插入出錯與DISTINCT SELECT語句
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01') x
INSERT OVERWRITE DIRECTORY 'results_gby_adid'
SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id
INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id
INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;
但我的測試,使用幾個不同的不能得到正確的結果。
我hiveql如下:
CREATE TABLE IF NOT EXISTS a (logindate int, id int);
然後 加載本地文件到這個表...
CREATE TABLE IF NOT EXISTS user (id INT) PARTITIONED BY (logindate INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
然後 如果將表分開:
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) FROM a WHERE logindate=20130120;
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) FROM a WHERE logindate=20130121;
結果是正確的;
但如果選擇下一個多重插入HQL:
FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) WHERE logindate=20130121;
the results are not correct, both partitions have the same number of records, seems like select from DISTINCT(id) WHERE logindate=20130120 OR logindate=20130121
所以它是一個錯誤還是我寫一些錯誤的語法?
我找到了解決方法來完成我的工作:'set hive.exec.dynamic.partition = true set hive.exec.dynamic.partition.mode = nostrict INSERT OVERWRITE TABLE user PARTITION(logindate)SELECT DISTINCT(id),logindate從DISTRIBUTE BY logindate'使用動態分區到初始表數據。 – yee 2013-03-04 03:16:26