2014-01-22 23 views
1

我試圖運行以下查詢。大小可能在數據庫的限制範圍內,但類似大小的表格正在工作。在Teradata中遇到spool錯誤

我知道有一種方法使用HASHAMP,HASHBUCKET,HASHROW函數對查詢進行分區,但我不知道如何執行此操作。

查詢很簡單,我只是檢查main_acct_product_id變量是否在b表上。

有關查詢中的表的一些信息:

sel count(*) from graph_total_3 
678.336.354 

top 5 of graph_total_3 
id_phone destino WEIGHT DIR access_method_id access_destino operador producto operador_destino 
2615071884 2615628271 0,42800 0,417000 T2615071884 T2615628271 A aa II 
1150421872 1159393065 343,200 0,424000 T1150421872 T1159393065 B bb LI 
2914076292 2914735291 0,16500 1,003,000 T2914076292 T2914735291 C ar OJ 
2914735291 2914076292 0,16500 -0,003000 T2914735291 T2914076292 A tm JA 
2804535124 2804454795 0,39600 1,000,000 T2804535124 T2804454795 B ma UE 

primary key(id_phone, destino); 

sel count(*) from producto 
26.473.287 

top 5 of producto 
    Access_Method_Id Main_Acct_Product_Id 
    T2974002818   PR_PPAL_AHORRO 
    T3875943432   PR_PPAL_ACTIVA 
    T2616294339   PR_PPAL_ACTIVA 
    T3516468805   PR_PPAL_ACTIVA 
    T2616818855   PR_PPAL_ACTIVA 

primary key(Access_Method_Id); 

SHOW TABLE

show table producto 

CREATE MULTISET VOLATILE TABLE MARBEL.producto ,NO FALLBACK , 
    CHECKSUM = DEFAULT, 
    LOG 
    (
     Access_Method_Id VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC, 
     Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC) 
PRIMARY INDEX (Access_Method_Id) 
ON COMMIT PRESERVE ROWS; 

show table graph_total_3 

CREATE MULTISET VOLATILE TABLE MARBEL.graph_total_3 ,NO FALLBACK , 
    CHECKSUM = DEFAULT, 
    LOG 
    (
     id_phone VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC, 
     destino VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC, 
     WEIGHT DECIMAL(10,5), 
     DIR DECIMAL(7,6), 
     access_method_id VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC, 
     access_destino VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC, 
     operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC, 
     producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC, 
     operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC) 
PRIMARY INDEX (id_phone ,destino) 
ON COMMIT PRESERVE ROWS; 

QUERY

create multiset volatile table graph_total_final as 
(
select a.* , coalesce(b.main_acct_product_id,'NO MOV') as producto_destino 
from graph_total_3 a 
left join producto b on a.access_destino=b.access_method_id 
) 
with data primary index (id_phone, destino) 
on commit preserve rows; 

解釋

 This query is optimized using type 1 profile bootstrap, profileid -/. 
     1) First, we create the table header. 
     2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an 
     all-rows scan with no residual conditions into Spool 2 (all_amps), 
     which is redistributed by the hash code of (
     MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order 
     Spool 2 by row hash. The result spool file will not be cached in 
     memory. The size of Spool 2 is estimated with high confidence to 
     be 678,343,248 rows (55,624,146,336 bytes). The estimated time 
     for this step is 2 minutes and 41 seconds. 
     3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a 
     RowHash match scan, which is joined to MARBEL.b by way of a 
     RowHash match scan. Spool 2 and MARBEL.b are left outer joined 
     using a merge join, with condition(s) used for non-matching on 
     left table ("NOT (access_destino IS NULL)"), with a join condition 
     of ("access_destino = MARBEL.b.Access_Method_Id"). The result 
     goes into Spool 1 (all_amps), which is redistributed by the hash 
     code of (MARBEL.a.id_phone, MARBEL.a.destino) to all AMPs. Then 
     we do a SORT to order Spool 1 by row hash. The result spool file 
     will not be cached in memory. The size of Spool 1 is estimated 
     with index join confidence to be 25,085,452,093 rows (
     2,232,605,236,277 bytes). The estimated time for this step is 1 
     hour and 45 minutes. 
     4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1 
     (Last Use). 
     5) Finally, we send out an END TRANSACTION step to all AMPs involved 
     in processing the request. 
     -> No rows are returned to the user as the result of statement 1. 

EXPLAIN 2

運行後:

DIAGNOSTIC HELPSTATS ON FOR SESSION; 
EXPLAIN 
create multiset volatile table graph_total_final as 
(
select a.* , coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino 
from graph_total_3 a 
left join producto b on a.access_destino=b.access_method_id 
) 
with data primary index (id_phone, destino, access_destino) 
on commit preserve rows; 

    EXPLAIN 
create multiset volatile table graph_total_final as 
(
select a.* , coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino 
from graph_total_3 a 
left join producto b on a.access_destino=b.access_method_id 
) 
with data primary index (id_phone, destino, access_destino) 
on commit preserve rows; 

This query is optimized using type 1 profile bootstrap, profileid -/. 
    1) First, we create the table header. 
    2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an 
    all-rows scan with no residual conditions into Spool 2 (all_amps), 
    which is redistributed by the hash code of (
    MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order 
    Spool 2 by row hash. The result spool file will not be cached in 
    memory. The size of Spool 2 is estimated with high confidence to 
    be 678,343,248 rows (55,624,146,336 bytes). The estimated time 
    for this step is 2 minutes and 41 seconds. 
    3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a 
    RowHash match scan, which is joined to MARBEL.b by way of a 
    RowHash match scan. Spool 2 and MARBEL.b are left outer joined 
    using a merge join, with condition(s) used for non-matching on 
    left table ("NOT (access_destino IS NULL)"), with a join condition 
    of ("access_destino = MARBEL.b.Access_Method_Id"). The result 
    goes into Spool 1 (all_amps), which is redistributed by the hash 
    code of (MARBEL.a.id_phone, MARBEL.a.destino, 
    MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order 
    Spool 1 by row hash. The result spool file will not be cached in 
    memory. The size of Spool 1 is estimated with index join 
    confidence to be 25,085,452,093 rows (2,232,605,236,277 bytes). 
    The estimated time for this step is 1 hour and 45 minutes. 
    4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1 
    (Last Use). 
    5) Finally, we send out an END TRANSACTION step to all AMPs involved 
    in processing the request. 
    -> No rows are returned to the user as the result of statement 1. 
    BEGIN RECOMMENDED STATS -> 
    6) "COLLECT STATISTICS MARBEL.producto COLUMN ACCESS_METHOD_ID". 
    (HighConf) 
    7) "COLLECT STATISTICS MARBEL.graph_total_3 COLUMN ACCESS_DESTINO". 
    (HighConf) 
    <- END RECOMMENDED STATS 
+1

您是否可以確認您已成功運行「收集統計數據MARBEL.producto COLUMN ACCESS_METHOD_ID」和「收集統計數據MARBEL.graph_total_3 COLUMN ACCESS_DESTINO」報表。 – 2014-01-22 13:39:59

+0

當然,我有!它提供了相同的it's中解釋:** EXPLAIN 2 ** – marbel

+1

我明白了,你可以張貼'這兩個表顯示表'輸出。 – 2014-01-22 13:42:09

回答

3

這些表是揮發性的表,這意味着您在當前會話中創建它們,你必須對自己的定義控制。

當你改變主索引MARBEL.graph_total_3access_destino你就會得到沒有任何準備直接AMP本地連接(你不需要收集統計數據,因爲這不會改變計劃,只是估計的數字更接近現實)。

由於新PI的表可能被扭曲,但是當你看着Exolain你會看到,否則卷軸將有一個PI上access_destino

如果MARBEL.producto.Access_Method_Id實際上是唯一的,您應該定義PI也是唯一的。這也會改善估計。

+0

+1有關易失性表 – 2014-01-22 17:32:12

+0

信息+1所以建議是將MARBEL.graph_total_3 PI更改爲PI我將加入其他表。 Access_Method_Id字段是唯一的。謝謝! – marbel

+1

是的,Teradata中最快的連接總是基於匹配主索引(和匹配分區)。 – dnoeth

2

兩件事情罷工奇怪,我直客蝙蝠。

我建議避免使用select a.*,...,除非你真的需要把A表中的所有列。這將減少需要在假脫機中保存的數據量。

看起來很可疑的第二件事是#3中的這句話The size of Spool 1 is estimated with index join confidence to be 25,085,452,093 rows您確定B表是access_method_id專欄的獨特之處 - 如果不是,您可能會無意中創建笛卡爾產品。 (250億行! - 真的!)。

此外,請告訴我們您的A & B表的人口統計信息(即主索引,表是否分區)。

更新(以後看到更多的信息) 唯一的其他東西我能想到的(特別是如果你的Teradata環境是不是有很多的磁盤空間,尤其是仡)是確保你的數據是壓縮越好。這將節省空間(即使數據存儲在假脫機空間中)並減少所需的假脫機空間量。

以下是B表中的壓縮候選項。

Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('PR_PPAL_AHORRO', 'PR_PPAL_ACTIVA', <continue with list for about the 200 most frequently occuring main ac product ids>). 

通過這樣做,在不增加CPU時間,可以壓縮每個16字節的字符串下降到幾位。

同樣對A表中的以下列也做同樣的處理。

 operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('A','B', 'C', <other more frequently occuring operdor ids>), 
     producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC compress('aa','bb', 'ar', <other more frequently occuring producto ids>), 
     operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('II','LI', 'OJ', <other more frequently occuring operador_destion ids>) 

考慮將id_phone & DESTINO如int或BIGINT(如果int是不夠大)。Bigint佔用8bytes,因爲存儲在varchar中的消耗高達10-12字節。當你有100多百萬行時,每個字節都有幫助。您也可以壓縮WEIGHT DIR列 - 例如:如果0.0000是最常發生的權重/目錄,那麼您可以指定壓縮(0.0000)和增益空間。所有compress語句必須在表創建時指定。

訪問method_id和access_destino似乎與只是id_phone「T」前綴看看你是否可以剝離第一字母,並將它們存儲爲整數。所有這些都會產生相當大的節省空間,並希望能夠減少這些費用。假脫機空間需要執行您的查詢。

最後,我不知道用hashamp /桶/行分區查詢(我分區表而不是查詢) - Teradata的應該是並行執行的所有查詢反正。

+0

這是獨一無二的。我會添加更多信息。 – marbel

+0

@MartínBel,在這種情況下,問題可能與A表。從解釋輸出中注意到A表正在由'access_destino'重新分配。你知道這個專欄是否收集了「統計」?如果不收集它們並再次運行解釋。 – 2014-01-22 13:29:38

+0

好的。我做了這件事,似乎沒有太大的改變。 – marbel