2012-08-25 59 views
0

我在Amazon的Elastic MapReduce中使用Hive創建了一個表,將數據導入並對其進行分區。現在我運行一個查詢來計算表格字段中最常見的單詞。爲什麼增加實例數並不會增加Hive查詢速度

我運行那個查詢時,我有1個主和2個核心實例,它需要180秒計算。然後我重新配置它有1個主控和10個內核,同樣需要180秒。爲什麼不更快?

我在2個內核和10個內核上運行時幾乎相同的輸出:

Total MapReduce jobs = 2 
Launching Job 1 out of 2 

Number of reduce tasks not specified. Estimated from input data size: 1 
In order to change the average load for a reducer (in bytes): 
    set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
    set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
    set mapred.reduce.tasks=<number> 
Starting Job = job_201208251929_0003, Tracking URL = http://ip-10-120-250-34.ec2.internal:9100/jobdetails. jsp?jobid=job_201208251929_0003 
Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.120.250.34:9001 -kill  job_201208251929_0003 
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 
2012-08-25 19:38:47,399 Stage-1 map = 0%, reduce = 0% 
2012-08-25 19:39:00,482 Stage-1 map = 3%, reduce = 0% 
2012-08-25 19:39:03,503 Stage-1 map = 5%, reduce = 0% 
2012-08-25 19:39:06,523 Stage-1 map = 10%, reduce = 0% 
2012-08-25 19:39:09,544 Stage-1 map = 18%, reduce = 0% 
2012-08-25 19:39:12,563 Stage-1 map = 24%, reduce = 0% 
2012-08-25 19:39:15,583 Stage-1 map = 35%, reduce = 0% 
2012-08-25 19:39:18,610 Stage-1 map = 45%, reduce = 0% 
2012-08-25 19:39:21,631 Stage-1 map = 53%, reduce = 0% 
2012-08-25 19:39:24,652 Stage-1 map = 67%, reduce = 0% 
2012-08-25 19:39:27,672 Stage-1 map = 75%, reduce = 0% 
2012-08-25 19:39:30,692 Stage-1 map = 89%, reduce = 0% 
2012-08-25 19:39:33,715 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 23.11 sec 
2012-08-25 19:39:34,723 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 23.11 sec 
2012-08-25 19:39:35,730 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 23.11 sec 
2012-08-25 19:39:36,802 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:37,810 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:38,819 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:39,827 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:40,835 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:41,845 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:42,856 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:43,865 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:44,873 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:45,882 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:46,891 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:47,900 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:48,908 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:49,916 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:50,924 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:51,934 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:52,942 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:53,950 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:54,958 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:55,967 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:56,976 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:57,990 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:59,001 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 62.57 sec 
2012-08-25 19:40:00,011 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 62.57 sec 
2012-08-25 19:40:01,022 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:02,031 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:03,041 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:04,051 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:05,060 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:06,070 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:07,079 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
MapReduce Total cumulative CPU time: 1 minutes 12 seconds 860 msec 
Ended Job = job_201208251929_0003 
Counters: 
Launching Job 2 out of 2 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
    set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
    set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
    set mapred.reduce.tasks=<number> 
Starting Job = job_201208251929_0004, Tracking URL = http://ip-10-120-250-34.ec2.internal:9100/jobdetails. jsp?jobid=job_201208251929_0004 
Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.120.250.34:9001 -kill  job_201208251929_0004 
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 
2012-08-25 19:40:30,147 Stage-2 map = 0%, reduce = 0% 
2012-08-25 19:40:43,241 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:44,254 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:45,262 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:46,272 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:47,282 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:48,290 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:49,298 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:50,306 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:51,315 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:52,323 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:53,331 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:54,339 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:55,347 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:56,357 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:57,365 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:58,374 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:40:59,384 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:00,393 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:01,407 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:02,420 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:03,431 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:04,443 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
MapReduce Total cumulative CPU time: 10 seconds 850 msec 
Ended Job = job_201208251929_0004 
Counters: 
MapReduce Jobs Launched: 
Job 0: Map: 2 Reduce: 1 Accumulative CPU: 72.86 sec HDFS Read: 4920 HDFS Write: 8371374 SUCCESS 
Job 1: Map: 1 Reduce: 1 Accumulative CPU: 10.85 sec HDFS Read: 8371850 HDFS Write: 456 SUCCESS 
Total MapReduce CPU Time Spent: 1 minutes 23 seconds 710 msec 

回答

1

你只有一個減速 - 這是做的大部分工作。我認爲這是一個原因。

+0

我再次嘗試並配置1 **大**主實例和2 **大**核心實例,工作耗時120秒,比小實例少60秒。 – keepkimi

+0

你應該比較不是120 vs 180,而是大致(120-60)vs(180 vs 60),其中60是工作啓動時間。所以你有2倍加速 –

+0

你能發佈查詢嗎?像蜂巢中的「order by」這樣的事物總是會經過一個reducer,因此如果結果集很大,應該避免它們。 –

0

我想,你應該增加你查詢執行的reducer的數量。 它是由下面的代碼來完成:

set mapred.reduce.tasks=n; 

其中n是減速器的量。

然後使用DISTRIBUTE BYCLUSTER BY子句(不要與CLUSTERED BY混淆)儘可能均勻地在減速器之間分配部分數據集。如果不需要排序,更好地運用DISTRIBUTE BY因爲

Cluster By是一條捷徑兩個Distribute BySort By

這裏是link to hive manual