分組時選擇最高計數的分類變數，

我有如下表：分組時選擇最高計數的分類變數，

我需要的是通過客戶ID中，我得到了最常見的類別這樣的方式聚集（最有效的方法貓），第二頻率和第三頻率。上述輸出應該

most freq 2nd most freq 3rd most freq 
1  B    A    C 
2  A    C    Null 
3  B    C    Null 
4  C    A    Null

當在計數領帶我真的不關心什麼是第一，什麼是第二。例如，對於客戶1而言，第二大多數頻率和第三大頻率可以互換，因爲它們中的每一個僅出現一次。

任何sql都會很好，最好是hive sql。

謝謝

來源

2017-10-16 criticalth

嘗試使用group by兩次，dense_rank()排序accorting到cat計數。其實我不是100％肯定的，但我想它也應該在蜂巢中工作。

select custId, 
    max(case when t.rn = 1 then cat end) as [most freq], 
    max(case when t.rn = 2 then cat end) as [2nd most freq], 
    max(case when t.rn = 3 then cat end) as [3th most freq] 
from 
(
    select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn 
    from your_table 
    group by custId, cat 
) t 
group by custId

demo

據我稍微加修改的方案的意見與蜂巢SQL

select custId, 
    max(case when t.rn = 1 then cat else null end) as most_freq, 
    max(case when t.rn = 2 then cat else null end) as 2nd_most_freq, 
    max(case when t.rn = 3 then cat else null end) as 3th_most_freq 
from 
(
    select custId, cat, dense_rank() over (partition by custId order by ct desc) rn 
    from (
    select custId, cat, count(*) ct 
    from your_table 
    group by custId, cat 
) your_table_with_counts 
) t 
group by custId

Hive SQL demo

來源

2017-10-16 13:30:15

使用'dense_rank'取代'row_number'符合，這樣的關係唐如果它們存在，則不會以第2和第3最常見的值出現。 –

@VamsiPrabhala是的，謝謝 –

也刪除'[]'爲列別名，因爲它們在Hive中不受支持。 –

分組時選擇最高計數的分類變數，

回答

相關問題