2014-01-07 63 views
0

我想按照下面查詢中指定的標準在過去3個月中每週彙總一次帳戶計數。在以num_of_accounts和星期爲列的表中獲取此數據的最有效方法是什麼?彙總配置單元中的每週數據

select COUNT(DISTINCT a.account_id) as num_accounts, 
WEEKOFYEAR(a.ds) as week 
FROM 
    (SELECT 
    CAST(account_id as BIGINT) 
    FROM 
    tableA 
    WHERE ds='2013-12-28') a 
JOIN 
    tableB b 
ON a.account_id=b.account_id AND 
    b.ds='2013-12-28' 
WHERE 
b.invoice_date between '2013-12-22' AND '2013-12-28' AND 
b.payment_status = 'failed' AND b.payment_status = 'unbilled' 

回答

1

您正在嘗試在大集合中進行計數。一種可擴展的方法是使用像hyperloglog或KMV草圖集的概率數據結構,如Brickhouse(http://github.com/klout/brickhouse)中提供的那些結構。有一篇博客文章描述了您的情況,就像您的http://brickhouseconfessions.wordpress.com/2013/12/11/using-sketch_set-for-reach-estimation/一樣。這應該給你一個相當接近的估計,而不必完全訴諸你的數據。

如果我正確理解你,你只想按星期彙總,在那裏你有一個Hive UDF WEEKOFYEAR,它從日期字符串返回一週。剛剛從Brickhouse

SELECT WEEKOFYEAR(ds), estimated_reach(sketch_set(account_id)) as num_account_est 
    FROM myquery 
GROUP BY WEEKOFYEAR(ds); 

哪裏更改爲MyQuery是代表你在上面表達的業務邏輯的視圖中使用sketch_set UDAF。