2017-04-26 31 views
2

我正在尋找一種智能的方法來統計事件發生的次數。多個列和行的Hive/SQL計數發生

下面是一個例子:

UserID  CityID CountryID TagID 
100000  1   30  5 
100001  1   30  6 
100000  2   20  7 
100000  2   40  8 
100001  1   40  6 
100002  1   40  5 
100002  1   20  6 

我想要做什麼:

我想算值的出現縱列,併爲每個用戶。最後,我想要一張表格,告訴我有多少用戶擁有不同的特性。

結果應該是這樣的 - 或多或少

Different_CityID Different_CountryIDs Different_TagIDs 
1     3      2 

說明:

  • Different_CityIDs:只需用戶名100000有不同的CityIDs
  • Different_CountryIDs:所有的用戶有不同的ID他們的國家
  • Different_TagIDs:UserID 100000和100002都有不同nt TagID。用戶100001只有「6」作爲TagID。

我掙扎着與COUNTs的列和GROUP BYs,但最終它不工作。有一個聰明的解決方案嗎?

非常感謝

回答

1
select count(case when pos=0 and count_distinct_ID>1 then 1 end) as different_cityid 
     ,count(case when pos=1 and count_distinct_ID>1 then 1 end) as different_countryid 
     ,count(case when pos=2 and count_distinct_ID>1 then 1 end) as different_tagid 

from (select  pe.pos 
        ,count (distinct pe.ID) as count_distinct_ID 
     from  mytable t 
        lateral view posexplode (array(CityID,CountryID,TagID)) pe as pos,ID 

     group by t.UserID 
        ,pe.pos   
     ) t   
; 

+------------------+---------------------+-----------------+ 
| different_cityid | different_countryid | different_tagid | 
+------------------+---------------------+-----------------+ 
|    1 |     3 |    2 | 
+------------------+---------------------+-----------------+ 

這裏是另一個變化是避免count(distinct ...)

select count (case when pos=0 and not is_distinct_ID then 1 end) as different_cityid 
     ,count (case when pos=1 and not is_distinct_ID then 1 end) as different_countryid 
     ,count (case when pos=2 and not is_distinct_ID then 1 end) as different_tagid 

from (select  pe.pos 
        ,min(pe.ID)<=>max(pe.ID) as is_distinct_ID 
     from  mytable t 
        lateral view posexplode (array(CityID,CountryID,TagID)) pe as pos,ID 

     group by t.UserID 
        ,pe.pos   
     ) t   
; 

...和另一個變化

select count (case when not is_distinct_CityID then 1 end) as different_cityid 
     ,count (case when not is_distinct_CountryID then 1 end) as different_countryid 
     ,count (case when not is_distinct_TagID  then 1 end) as different_tagid 

from (select  min (CityID) <=> max (CityID)  as is_distinct_CityID 
        ,min (CountryID) <=> max (CountryID) as is_distinct_CountryID 
        ,min (TagID)  <=> max (TagID)  as is_distinct_TagID 

     from  mytable 

     group by UserID  
     ) t   
; 
+0

耶穌基督!謝謝@Dudu!這看起來很奇怪,我會嘗試。從未見過「側面視圖」。 – Peter

+0

您的歡迎:-)檢查更新的答案 - 爲教育目的 –

+0

是什麼「<=>」運營商怎麼辦?根據配置單元文檔中的解釋,我不完全清楚它的功能。 – invoketheshell

1

使用下面的代碼,我認爲它幫助你,

SELECT COUNT(DISTINCT (CountryID)) AS CountryID, 
COUNT(DISTINCT(CityID)) AS CityID, 
COUNT(DISTINCT(TagID)) AS TagID 
FROM test GROUP BY UserID 

結果會是這樣,

CountryID CityID TagID 
2 3 3 
1 2 1 
1 2 2 

問候, Vinu

+0

嘿Vinu,感謝您的評論。可悲的是,這不是我想要的。如果你檢查上面的例子,我想提取每個列有不同特徵的用戶數量。用戶對於一個國家有兩種或三種不同的特徵並不重要。我只想檢查有多少用戶具有多個特徵。希望你明白我的意思。 – Peter

1

select uid,cid,count(c),count(g) from(select cid,uid,count(coid) over(partition by cid,uid) as c,count(tagid) over(partition by cid,tagid) as g from citydata)e group by cid,uid;

這裏UID = userid,cid = cityid,coid = c ountryid,標籤識別

Total MapReduce CPU Time Spent: 0 msec OK uid cid coid tagid 100000 1 1 1 100001 1 2 2 100002 1 2 2 100000 2 2 2 Time taken: 3.865 seconds, Fetched: 4 row(s)

基於userid我希望這將有助於

+0

OP對請求的結果非常清楚 –

+0

謝謝.... :) – overflow