在Teradata中遇到spool錯誤

我試圖運行以下查詢。大小可能在數據庫的限制範圍內，但類似大小的表格正在工作。在Teradata中遇到spool錯誤

我知道有一種方法使用HASHAMP，HASHBUCKET，HASHROW函數對查詢進行分區，但我不知道如何執行此操作。

查詢很簡單，我只是檢查main_acct_product_id變量是否在b表上。

有關查詢中的表的一些信息：

sel count(*) from graph_total_3 
678.336.354 

top 5 of graph_total_3 
id_phone destino WEIGHT DIR access_method_id access_destino operador producto operador_destino 
2615071884 2615628271 0,42800 0,417000 T2615071884 T2615628271 A aa II 
1150421872 1159393065 343,200 0,424000 T1150421872 T1159393065 B bb LI 
2914076292 2914735291 0,16500 1,003,000 T2914076292 T2914735291 C ar OJ 
2914735291 2914076292 0,16500 -0,003000 T2914735291 T2914076292 A tm JA 
2804535124 2804454795 0,39600 1,000,000 T2804535124 T2804454795 B ma UE 

primary key(id_phone, destino); 

sel count(*) from producto 
26.473.287 

top 5 of producto 
    Access_Method_Id Main_Acct_Product_Id 
    T2974002818   PR_PPAL_AHORRO 
    T3875943432   PR_PPAL_ACTIVA 
    T2616294339   PR_PPAL_ACTIVA 
    T3516468805   PR_PPAL_ACTIVA 
    T2616818855   PR_PPAL_ACTIVA 

primary key(Access_Method_Id);

SHOW TABLE

show table producto 

CREATE MULTISET VOLATILE TABLE MARBEL.producto ,NO FALLBACK , 
    CHECKSUM = DEFAULT, 
    LOG 
    (
     Access_Method_Id VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC, 
     Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC) 
PRIMARY INDEX (Access_Method_Id) 
ON COMMIT PRESERVE ROWS; 

show table graph_total_3 

CREATE MULTISET VOLATILE TABLE MARBEL.graph_total_3 ,NO FALLBACK , 
    CHECKSUM = DEFAULT, 
    LOG 
    (
     id_phone VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC, 
     destino VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC, 
     WEIGHT DECIMAL(10,5), 
     DIR DECIMAL(7,6), 
     access_method_id VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC, 
     access_destino VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC, 
     operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC, 
     producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC, 
     operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC) 
PRIMARY INDEX (id_phone ,destino) 
ON COMMIT PRESERVE ROWS;

QUERY

create multiset volatile table graph_total_final as 
(
select a.* , coalesce(b.main_acct_product_id,'NO MOV') as producto_destino 
from graph_total_3 a 
left join producto b on a.access_destino=b.access_method_id 
) 
with data primary index (id_phone, destino) 
on commit preserve rows;

解釋

 This query is optimized using type 1 profile bootstrap, profileid -/. 
     1) First, we create the table header. 
     2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an 
     all-rows scan with no residual conditions into Spool 2 (all_amps), 
     which is redistributed by the hash code of (
     MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order 
     Spool 2 by row hash. The result spool file will not be cached in 
     memory. The size of Spool 2 is estimated with high confidence to 
     be 678,343,248 rows (55,624,146,336 bytes). The estimated time 
     for this step is 2 minutes and 41 seconds. 
     3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a 
     RowHash match scan, which is joined to MARBEL.b by way of a 
     RowHash match scan. Spool 2 and MARBEL.b are left outer joined 
     using a merge join, with condition(s) used for non-matching on 
     left table ("NOT (access_destino IS NULL)"), with a join condition 
     of ("access_destino = MARBEL.b.Access_Method_Id"). The result 
     goes into Spool 1 (all_amps), which is redistributed by the hash 
     code of (MARBEL.a.id_phone, MARBEL.a.destino) to all AMPs. Then 
     we do a SORT to order Spool 1 by row hash. The result spool file 
     will not be cached in memory. The size of Spool 1 is estimated 
     with index join confidence to be 25,085,452,093 rows (
     2,232,605,236,277 bytes). The estimated time for this step is 1 
     hour and 45 minutes. 
     4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1 
     (Last Use). 
     5) Finally, we send out an END TRANSACTION step to all AMPs involved 
     in processing the request. 
     -> No rows are returned to the user as the result of statement 1.

EXPLAIN 2

運行後：

DIAGNOSTIC HELPSTATS ON FOR SESSION; 
EXPLAIN 
create multiset volatile table graph_total_final as 
(
select a.* , coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino 
from graph_total_3 a 
left join producto b on a.access_destino=b.access_method_id 
) 
with data primary index (id_phone, destino, access_destino) 
on commit preserve rows; 

    EXPLAIN 
create multiset volatile table graph_total_final as 
(
select a.* , coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino 
from graph_total_3 a 
left join producto b on a.access_destino=b.access_method_id 
) 
with data primary index (id_phone, destino, access_destino) 
on commit preserve rows; 

This query is optimized using type 1 profile bootstrap, profileid -/. 
    1) First, we create the table header. 
    2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an 
    all-rows scan with no residual conditions into Spool 2 (all_amps), 
    which is redistributed by the hash code of (
    MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order 
    Spool 2 by row hash. The result spool file will not be cached in 
    memory. The size of Spool 2 is estimated with high confidence to 
    be 678,343,248 rows (55,624,146,336 bytes). The estimated time 
    for this step is 2 minutes and 41 seconds. 
    3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a 
    RowHash match scan, which is joined to MARBEL.b by way of a 
    RowHash match scan. Spool 2 and MARBEL.b are left outer joined 
    using a merge join, with condition(s) used for non-matching on 
    left table ("NOT (access_destino IS NULL)"), with a join condition 
    of ("access_destino = MARBEL.b.Access_Method_Id"). The result 
    goes into Spool 1 (all_amps), which is redistributed by the hash 
    code of (MARBEL.a.id_phone, MARBEL.a.destino, 
    MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order 
    Spool 1 by row hash. The result spool file will not be cached in 
    memory. The size of Spool 1 is estimated with index join 
    confidence to be 25,085,452,093 rows (2,232,605,236,277 bytes). 
    The estimated time for this step is 1 hour and 45 minutes. 
    4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1 
    (Last Use). 
    5) Finally, we send out an END TRANSACTION step to all AMPs involved 
    in processing the request. 
    -> No rows are returned to the user as the result of statement 1. 
    BEGIN RECOMMENDED STATS -> 
    6) "COLLECT STATISTICS MARBEL.producto COLUMN ACCESS_METHOD_ID". 
    (HighConf) 
    7) "COLLECT STATISTICS MARBEL.graph_total_3 COLUMN ACCESS_DESTINO". 
    (HighConf) 
    <- END RECOMMENDED STATS

來源

2014-01-22 marbel

您是否可以確認您已成功運行「收集統計數據MARBEL.producto COLUMN ACCESS_METHOD_ID」和「收集統計數據MARBEL.graph_total_3 COLUMN ACCESS_DESTINO」報表。 – 2014-01-22 13:39:59

當然，我有！它提供了相同的it's中解釋：** EXPLAIN 2 ** – marbel

我明白了，你可以張貼'這兩個表顯示表'輸出。 – 2014-01-22 13:42:09

這些表是揮發性的表，這意味着您在當前會話中創建它們，你必須對自己的定義控制。

當你改變主索引MARBEL.graph_total_3到access_destino你就會得到沒有任何準備直接AMP本地連接（你不需要收集統計數據，因爲這不會改變計劃，只是估計的數字更接近現實）。

由於新PI的表可能被扭曲，但是當你看着Exolain你會看到，否則卷軸將有一個PI上access_destino。

如果MARBEL.producto.Access_Method_Id實際上是唯一的，您應該定義PI也是唯一的。這也會改善估計。

來源

2014-01-22 17:29:22 dnoeth

+1有關易失性表 – 2014-01-22 17:32:12

信息+1所以建議是將MARBEL.graph_total_3 PI更改爲PI我將加入其他表。 Access_Method_Id字段是唯一的。謝謝！ – marbel

是的，Teradata中最快的連接總是基於匹配主索引（和匹配分區）。 – dnoeth

兩件事情罷工奇怪，我直客蝙蝠。

我建議避免使用select a.*,...，除非你真的需要把A表中的所有列。這將減少需要在假脫機中保存的數據量。

看起來很可疑的第二件事是＃3中的這句話The size of Spool 1 is estimated with index join confidence to be 25,085,452,093 rows您確定B表是access_method_id專欄的獨特之處 - 如果不是，您可能會無意中創建笛卡爾產品。（250億行！ - 真的！）。

此外，請告訴我們您的A & B表的人口統計信息（即主索引，表是否分區）。

更新（以後看到更多的信息） 唯一的其他東西我能想到的（特別是如果你的Teradata環境是不是有很多的磁盤空間，尤其是仡）是確保你的數據是壓縮越好。這將節省空間（即使數據存儲在假脫機空間中）並減少所需的假脫機空間量。

以下是B表中的壓縮候選項。

Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('PR_PPAL_AHORRO', 'PR_PPAL_ACTIVA', <continue with list for about the 200 most frequently occuring main ac product ids>).

通過這樣做，在不增加CPU時間，可以壓縮每個16字節的字符串下降到幾位。

同樣對A表中的以下列也做同樣的處理。

 operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('A','B', 'C', <other more frequently occuring operdor ids>), 
     producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC compress('aa','bb', 'ar', <other more frequently occuring producto ids>), 
     operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('II','LI', 'OJ', <other more frequently occuring operador_destion ids>)

考慮將id_phone & DESTINO如int或BIGINT（如果int是不夠大）。Bigint佔用8bytes，因爲存儲在varchar中的消耗高達10-12字節。當你有100多百萬行時，每個字節都有幫助。您也可以壓縮WEIGHT DIR列 - 例如：如果0.0000是最常發生的權重/目錄，那麼您可以指定壓縮（0.0000）和增益空間。所有compress語句必須在表創建時指定。

訪問method_id和access_destino似乎與只是id_phone「T」前綴看看你是否可以剝離第一字母，並將它們存儲爲整數。所有這些都會產生相當大的節省空間，並希望能夠減少這些費用。假脫機空間需要執行您的查詢。

最後，我不知道用hashamp /桶/行分區查詢（我分區表而不是查詢） - Teradata的應該是並行執行的所有查詢反正。

來源

2014-01-22 13:21:28

這是獨一無二的。我會添加更多信息。 – marbel

@MartínBel，在這種情況下，問題可能與A表。從解釋輸出中注意到A表正在由'access_destino'重新分配。你知道這個專欄是否收集了「統計」？如果不收集它們並再次運行解釋。 – 2014-01-22 13:29:38

好的。我做了這件事，似乎沒有太大的改變。 – marbel

在Teradata中遇到spool錯誤

回答

相關問題