2011-08-03 36 views
1

總結: 我覺得我的系統忽略了預分類表的概念。 - 我期望在排序步驟中節省時間,因爲我使用了 預先排序的數據,但查詢計劃似乎指示了排序步驟中的中間 。在Hive中使用分類表

的骯髒細節如下:

的設置=======

我已經設置了以下標誌:=============

set hive.enforce.bucketing = true; 
set mapred.reduce.tasks=8; 
set mapred.map.tasks=8; 

在這裏,我創建一個表來保存在磁盤上的數據========的臨時副本

CREATE TABLE trades 
     (symbol STRING, exchange STRING, price FLOAT, volume INT, cond 
INT, bid FLOAT, ask FLOAT, time STRING) 
PARTITIONED BY (dt STRING) 
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE; 

在這裏,我將數據複製磁盤到日e表 順便說一句,這裏的數據按符號聚類,並按時間排序。 我似乎無法得到蜂巢使用這個概念...即避免 再次排序

LOAD DATA LOCAL INPATH '%(dir)s2010-05-07' 
INTO TABLE trades 
partition (dt='2010-05-07'); 

我用下面的決賽桌執行了瓢潑大雨=========== 和強加排序順序===========

CREATE TABLE alltrades 
     (symbol STRING, exchange STRING, price FLOAT, volume INT, cond 
INT, bid FLOAT, ask FLOAT, time STRING) 
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE; 

數據從蜂巢表==========加載

insert overwrite table alltrades 
select symbol, exchange, price, volume, cond, bid, ask, time 
from trades 
distribute by symbol sort by symbol, time; 

令人失望的是,以看到任何查詢所有需要 排序的符號,時間是否重新排序...有沒有圍繞這個方法 ? 此外,有沒有辦法讓這整個過程工作在1查詢步驟 而不是2?

爲什麼分揀似乎不起作用=======

注意到,該表構建,並與排序子句填充。 恐怕,如果不需要排序,那麼放棄這些將導致未來減速器的行爲 。

下面是一個查詢,在我看來,不應該 涉及分揀計劃......但實際上做。========

hive> explain select symbol, time, price from alltrades sort by symbol, time; 
OK 
ABSTRACT SYNTAX TREE: 
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades))) 
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL 
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY 
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol)) 
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time))))) 

STAGE DEPENDENCIES: 
Stage-1 is a root stage 
Stage-0 is a root stage 

STAGE PLANS: 
Stage: Stage-1 
    Map Reduce 
    Alias -> Map Operator Tree: 
     alltrades 
     TableScan 
      alias: alltrades 
      Select Operator 
      expressions: 
        expr: symbol 
        type: string 
        expr: time 
        type: string 
        expr: price 
        type: float 
      outputColumnNames: _col0, _col1, _col2 
      Reduce Output Operator 
       key expressions: 
        expr: _col0 
        type: string 
        expr: _col1 
        type: string 
       sort order: ++ 
       tag: -1 
       value expressions: 
        expr: _col0 
        type: string 
        expr: _col1 
        type: string 
        expr: _col2 
        type: float 
    Reduce Operator Tree: 
     Extract 
     File Output Operator 
      compressed: false 
      GlobalTableId: 0 
      table: 
       input format: org.apache.hadoop.mapred.TextInputFormat 
       output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 

Stage: Stage-0 
    Fetch Operator 
    limit: -1 

回答

3

你檢查出的set hive.enforce.bucketing=true效果?從http://svn.apache.org/repos/asf/hive/branches/branch-0.7/conf/hive-default.xml

<property> 
    <name>hive.enforce.sorting</name> 
    <value>false</value> 
    <description>Whether sorting is enforced. If true, while inserting into the table, sorting is enforced. </description> 
</property> 

您也可以找到讀書的org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer#genBucketingSortingDest有用的實現:

http://svn.apache.org/repos/asf/hive/branches/branch-0.7/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

+0

感謝,但我放棄了蜂巢而回......圖我想要的東西打火機像蟒蛇迪斯科。 – fodon

4

hive.enforce.bucketing不做數據集的全球排序。相反,它會寫入在桶中分類的數據(在你的案例8 /分區中)。因此,它需要一個全球排序步驟來滿足您正在查找的查詢。

希望這有助於 納特

0

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

The CLUSTERED BY and SORTED BY creation commands do not affect how 
data is inserted into a table – only how it is read. This means that 
users must be careful to insert data correctly by specifying the 
number of reducers to be equal to the number of buckets, and using 
CLUSTER BY and SORT BY commands in their query. 

也期待在https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy