2017-07-28 59 views
0

我想選擇表格中每個組的X個最常見的配對。 讓我們在下表中考慮:在Redshift中按類別選擇n最大計數

+-------------+-----------+ 
| identifier | city | 
+-------------+-----------+ 
| AB   | Seattle | 
| AC   | Seattle | 
| AC   | Seattle | 
| AB   | Seattle | 
| AD   | Seattle | 
| AB   | Chicago | 
| AB   | Chicago | 
| AD   | Chicago | 
| AD   | Chicago | 
| BC   | Chicago | 
+-------------+-----------+ 
  • 西雅圖,AB發生2倍
  • 西雅圖,AC發生2倍
  • 西雅圖,AD發生1X
  • 芝加哥,AB發生2倍
  • 芝加哥, AD發生2次
  • 公元前芝加哥發生1x

如果我想選擇每個城市的2個最公地,結果應該是:

+-------------+-----------+ 
| identifier | city | 
+-------------+-----------+ 
| AB   | Seattle | 
| AC   | Seattle | 
| AB   | Chicago | 
| AD   | Chicago | 
+-------------+-----------+ 

任何幫助表示讚賞。謝謝, Benni

+0

的[獲取前n個記錄各組分組結果]可能的複製(https://stackoverflow.com/questions/12113699/get-top- n-records-for-each-group-of-grouped-results) – mato

回答

1

您可以在行號中使用count來訂購每個城市組合的出場次數,並選擇前兩個。

select city,identifier 
from (
select city,identifier 
,row_number() over(partition by city order by count(*) desc,identifier) as rnum_cnt 
from tbl 
group by city,identifier 
) t 
where rnum_cnt<=2 
+0

你不能在分區內使用count(*)',至少在Redshift中,計數應該在子查詢中完成 – AlexYes

+0

@AlexYes看起來你能夠。答案中的查詢給了我正確的結果。此外,[documentation](http://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_synopsis.html)表示在訂單列表中允許使用表達式。 –

+0

@DmitriiI。有趣!當我看到文檔時,我只想着標量表達式,我不知道Redshift是那麼聰明:)謝謝! – AlexYes

0

使用WITH條款:

with 
    _counts as (
     select 
      identifier, 
      city, 
      count(*) as city_id_count 
     from 
      t1 
     group by 
      identifier, 
      city 
    ), 

    _counts_and_max as (
     select 
      identifier, 
      city, 
      city_id_count, 
      max(city_id_count) over (partition by city) as city_max_count 
     from 
      _counts 
    ) 

    select 
     identifier, 
     city 
    from 
     _counts_and_max 
    where 
     city_id_count = city_max_count 
    ;