多個列和行的Hive/SQL計數發生

我正在尋找一種智能的方法來統計事件發生的次數。多個列和行的Hive/SQL計數發生

下面是一個例子：

UserID  CityID CountryID TagID 
100000  1   30  5 
100001  1   30  6 
100000  2   20  7 
100000  2   40  8 
100001  1   40  6 
100002  1   40  5 
100002  1   20  6

我想要做什麼：

我想算值的出現縱列，併爲每個用戶。最後，我想要一張表格，告訴我有多少用戶擁有不同的特性。

結果應該是這樣的 - 或多或少

Different_CityID Different_CountryIDs Different_TagIDs 
1     3      2

說明：

Different_CityIDs：只需用戶名100000有不同的CityIDs
Different_CountryIDs：所有的用戶有不同的ID他們的國家
Different_TagIDs：UserID 100000和100002都有不同nt TagID。用戶100001只有「6」作爲TagID。

我掙扎着與COUNTs的列和GROUP BYs，但最終它不工作。有一個聰明的解決方案嗎？

非常感謝

來源

2017-04-26 Peter

select count(case when pos=0 and count_distinct_ID>1 then 1 end) as different_cityid 
     ,count(case when pos=1 and count_distinct_ID>1 then 1 end) as different_countryid 
     ,count(case when pos=2 and count_distinct_ID>1 then 1 end) as different_tagid 

from (select  pe.pos 
        ,count (distinct pe.ID) as count_distinct_ID 
     from  mytable t 
        lateral view posexplode (array(CityID,CountryID,TagID)) pe as pos,ID 

     group by t.UserID 
        ,pe.pos   
     ) t   
;

+------------------+---------------------+-----------------+ 
| different_cityid | different_countryid | different_tagid | 
+------------------+---------------------+-----------------+ 
|    1 |     3 |    2 | 
+------------------+---------------------+-----------------+

這裏是另一個變化是避免count(distinct ...)

select count (case when pos=0 and not is_distinct_ID then 1 end) as different_cityid 
     ,count (case when pos=1 and not is_distinct_ID then 1 end) as different_countryid 
     ,count (case when pos=2 and not is_distinct_ID then 1 end) as different_tagid 

from (select  pe.pos 
        ,min(pe.ID)<=>max(pe.ID) as is_distinct_ID 
     from  mytable t 
        lateral view posexplode (array(CityID,CountryID,TagID)) pe as pos,ID 

     group by t.UserID 
        ,pe.pos   
     ) t   
;

...和另一個變化

select count (case when not is_distinct_CityID then 1 end) as different_cityid 
     ,count (case when not is_distinct_CountryID then 1 end) as different_countryid 
     ,count (case when not is_distinct_TagID  then 1 end) as different_tagid 

from (select  min (CityID) <=> max (CityID)  as is_distinct_CityID 
        ,min (CountryID) <=> max (CountryID) as is_distinct_CountryID 
        ,min (TagID)  <=> max (TagID)  as is_distinct_TagID 

     from  mytable 

     group by UserID  
     ) t   
;

來源

2017-04-26 08:45:22

耶穌基督！謝謝@Dudu！這看起來很奇怪，我會嘗試。從未見過「側面視圖」。 – Peter

您的歡迎:-)檢查更新的答案 - 爲教育目的 –

是什麼「<=>」運營商怎麼辦？根據配置單元文檔中的解釋，我不完全清楚它的功能。 – invoketheshell

使用下面的代碼，我認爲它幫助你，

SELECT COUNT(DISTINCT (CountryID)) AS CountryID, 
COUNT(DISTINCT(CityID)) AS CityID, 
COUNT(DISTINCT(TagID)) AS TagID 
FROM test GROUP BY UserID

結果會是這樣，

CountryID CityID TagID 
2 3 3 
1 2 1 
1 2 2

問候， Vinu

來源

2017-04-26 09:47:27

嘿Vinu，感謝您的評論。可悲的是，這不是我想要的。如果你檢查上面的例子，我想提取每個列有不同特徵的用戶數量。用戶對於一個國家有兩種或三種不同的特徵並不重要。我只想檢查有多少用戶具有多個特徵。希望你明白我的意思。 – Peter

select uid,cid,count(c),count(g) from(select cid,uid,count(coid) over(partition by cid,uid) as c,count(tagid) over(partition by cid,tagid) as g from citydata)e group by cid,uid;

這裏UID = userid，cid = cityid，coid = c ountryid，標籤識別

Total MapReduce CPU Time Spent: 0 msec OK uid cid coid tagid 100000 1 1 1 100001 1 2 2 100002 1 2 2 100000 2 2 2 Time taken: 3.865 seconds, Fetched: 4 row(s)

基於userid我希望這將有助於

來源

2017-04-26 17:03:42 overflow

OP對請求的結果非常清楚 –

謝謝.... :) – overflow

多個列和行的Hive/SQL計數發生

回答

相關問題