2016-09-13 37 views
2

給定一個表如何在Hive中找到最近的鄰居?任何窗口功能?

$cat data.csv 

ID,State,City,Price,Flag 
1,CA,A,95,0 
2,CA,A,96,1 
3,CA,A,195,1 
4,NY,B,124,0 
5,NY,B,128,1 
6,NY,C,24,0 
7,NY,C,27,1 
8,NY,C,29,0 
9,NY,C,39,1 

預期結果:

ID0, ID1 
1,2 
4,5 
6,7 
8,7 

用於與標誌= 0以上,我們想找到從標誌= 1另一ID,具有相同的各ID「州「和」城市「,以及最近的價格。

我有兩個粗糙的餿主意:

方法1.

Use a left outer join with the table itself on 
    (a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1), 
    where a.Flag=0 and b.Flag=1, 

    and then use RANK() over (partitioned by a.State,a.City order by a.Price - b.Price) as rank 
    where rank=1 

方法2

Use a left outer join with the table itself, 
on 
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1), 
where a.Flag=0 and b.Flag=1, 

and then Use Distribute by a.State,a.City Sort by Price_Diff ASC limit 1 

什麼是找到蜂巢近鄰的最佳方式? 任何有價值的提示將不勝感激!

回答

1
select a.id, b.id , min(abs(b.price-a.price)) as delta 
from data as a 
    inner join data as b 
      on a.country=b.country and 
       a.flag=0 and b.flag=1 and 
       a.city=b.city 
group by a.id, b.id 
order by delta asc; 

這將返回

1 2 1 <--- 
8 7 2 <--- 
6 7 3 <--- 
4 5 4 <--- 
8 9 10 
6 9 15 
1 3 100 

的問題是,在最後3行有使用到第4

select a.id as id0, b.id as id1, abs(b.price-a.price) as delta, 
     rank() over (partition by a.country, a.city order by abs(b.price-a.price)) 
from data as a 
     inner join data as b 
      on a.country=b.country and 
      a.flag=0 and b.flag=1 and 
      a.city=b.city; 

相同的ID這將返回

id0 id1 prc rank 
    1 2 1 1 <--- 
    1 3 100 2 
    4 5 4 1 <--- 
    8 7 2 1 <--- 
    6 7 3 2 
    8 9 10 3 
    6 9 15 4 

我們缺少6,7和這個不知何故是正確的。 (6,7),(6,9),(8,7),(8,9)的最低價差爲(8,7)。 (曖昧加盟)

我想你會喜歡這個視頻關於這個話題:Big Data Analytics Using Window Functions