2014-09-06 90 views
0

獲得百分比我有一個表,如下所示:從計數蜂巢

COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 
e-12 1101 201408110525 Arts and Entertainment Television 
e-12 1101 201408110525 Arts and Entertainment Television 
e-12 1101 201408110525 Arts and Entertainment Television 
e-12 1101 201408110620 Technology and Computing Internet Technology 
e-12 1101 201408110705 Technology and Computing Antivirus Software 
e-12 1107 201408110510 Business Advertising 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1109 201408110505 Technology and Computing Web Search 

忽視COL1(因爲他們都是一樣的),爲每一位COL2,有其餘字段的組合。我設法重複組合的數量,從而產生以下:

COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT 
e-12 1101 201408110525 Arts and Entertainment Television 3 
e-12 1101 201408110620 Technology and Computing Internet Technology 1 
e-12 1101 201408110705 Technology and Computing Antivirus Software 1 
e-12 1107 201408110510 Business Advertising 1 
e-12 1107 201408110520 Business Marketing 7 
e-12 1109 201408110505 Technology and Computing Web Search 1 

如何轉數爲每COL2所有組合的百分比是多少?

我很抱歉,我不能更好地把這個詞,但輸出應該是這樣的:

COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT PERCENTAGE 
e-12 1101 201408110525 Arts and Entertainment Television 3 60% 
e-12 1101 201408110620 Technology and Computing Internet Technology 1 20% 
e-12 1101 201408110705 Technology and Computing Antivirus Software 1 20% 
e-12 1107 201408110510 Business Advertising 1 12.5% 
e-12 1107 201408110520 Business Marketing 7 87.5% 
e-12 1109 201408110505 Technology and Computing Web Search 1 100% 

注:在這一點上,計數是沒有必要的。

這甚至可能在蜂巢?我如何修改我的計數查詢(下)以輸出最後一個表?

SELECT COL1, COL2, DATETIMESTAMP, CATEGORY1, CATEGORY2, count(*) FROM temp_table GROUP BY CATEGORY1, CATEGORY2, DATETIMESTAMP, COL2, COL1 SORT BY COL2; 

謝謝。

+0

你可以指望的COL2和產品組別分別使用兩個SELECT語句,然後在主SELECT語句 – 2014-09-06 18:30:17

回答

1

我可以考慮幾種方法來做到這一點。您可以計算您的百分比中的分母,然後將其加回到原始數據中,然後除以總數得到SUM。此外,如果您有權訪問Hive中的windowing functions(我相信它們的發貨時間爲0.13),則可以使用SELECT中的OVERPARTITION語句來避免第一部分中描述的聯接。

#1:

select col2, cat1, cat2, datetimestamp 
    ,(COUNT(cat2)/MAX(total_)) as perc 
from (
    select n.col2, cat1, cat2, datetimestamp, x.total_ 
    from some_table as n 
    JOIN (
     select col2, COUNT(col2) as total_ 
     from some_table 
     group by col2 
     ) x 
    ON x.col2 = n.col2 
    ) y 
group by cat1, cat2, col2, datetimestamp 

#2:

select col2, cat1, cat2, datetimestamp 
    ,(COUNT(col2)/MAX(total)) as perc 
from (
    select col2, cat1, cat2 
     ,COUNT(cat1) OVER (PARTITION BY col2) as total 
    from some_table 
    ) x 
group by cat1, cat2, col2, datetimestamp 
+0

我使用這些使用樣品#2。我遇到了'datetimestamp'的問題,所以我將它添加到內部select語句中。同樣,我將'perc'乘以100,以便更接近地模仿百分比符號的外觀。我的編輯會影響準確性嗎?我在上面的示例數據上測試了你的代碼 - 到目前爲止,非常好。 – 2014-09-07 06:37:16