2016-12-22 60 views
2

背景 - 我有一組客戶數據和使用的字符串匹配算法來比較的所有記錄的相似性。然後,我需要直接或通過關聯將彼此相關的結果進行分組,併爲每個組應用唯一的ID。SQL服務器記錄鏈接字符串匹配後

問題 - 我不能想辦法以連接在一起的記錄,並應用一個唯一的ID爲每個組

數據目前看起來是這樣的,已找到的匹配(MatchScore與此處的問題無關,只是爲了證明數據來自哪裏)。

+-------------+-------------+------------+ 
| CustomerID1 | CustomerID2 | MatchScore | 
+-------------+-------------+------------+ 
|  2021000 |  2707799 | 0.075  | 
|  2021000 |  3856308 | 0.082  | 
|  774062 |  774063 | 0.041  | 
|  998328 |  2278386 | 0.063  | 
|  998328 |  998329 | 0.058  | 
|  998329 |  2278386 | 0.030  | 
+-------------+-------------+------------+ 

底部的3條記錄都是鏈接的,因此我希望它們有相同的ID關聯。

visual image of these records all being related

這就是我想要的數據看起來像

+----+-------------+-------------+------------+ 
| ID | CustomerID1 | CustomerID2 | MatchScore | 
+----+-------------+-------------+------------+ 
| 1 |  998328 |  2278386 | 0.063  | 
| 1 |  998328 |  998329 | 0.058  | 
| 1 |  998329 |  2278386 | 0.030  | 
| 2 |  2021000 |  2707799 | 0.075  | 
| 2 |  2021000 |  3856308 | 0.082  | 
| 3 |  774062 |  774063 | 0.041  | 
+----+-------------+-------------+------------+ 

或類似

+----+------------+ 
| ID | CustomerID | 
+----+------------+ 
| 1 | 2278386 | 
| 1 |  998328 | 
| 1 |  998329 | 
| 2 | 2021000 | 
| 2 | 2707799 | 
| 2 | 3856308 | 
| 3 |  774062 | 
| 3 |  774063 | 
+----+------------+ 

代碼來生成示例表

select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore 
into #tmp 
union 
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore 
union 
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore 
union 
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore 
union 
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore 
union 
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore 

select * from #tmp 

正如我所說,我不知道如何將記錄聯繫在一起,我嘗試了各種聯合,但是尤里卡時刻從未到來。請你幫忙。

感謝

+3

底部3條記錄是什麼意思?它們是否僅僅因爲CustomerID1被列出了多個CustomerId2值而被鏈接?爲什麼'CustomerID1'998328和998329最終具有相同的'ID'值? – Taryn

+0

它,因爲3個獨立的記錄意味着客戶998328和2278386匹配,998328和998329的比賽,998329和2278386匹配。因此,所有3個都被證明是相互匹配的,所以得到相同的ID。 – DataPro

回答

1

我不知道這是你期望的結果,

with tmp as(
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore 
union 
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore 
union 
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore 
union 
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore 
union 
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore 
union 
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore 
union 
select '774063' as CustomerID1,'774062' as CustomerID2,'0.041' as MatchScore 
union 
select '774063' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore) 


select DENSE_RANK() OVER(ORDER BY rank_value) id, t1.CustomerID1, t1.CustomerID2 
from(
    select 
     t1.*, 
     case 
      when t2.CustomerID1 IS NOT NULL 
       THEN t2.CustomerID1 
      ELSE t3.CustomerID1 
     end rank_value 

    from tmp t1 
    left join tmp t2 
    on (t1.CustomerID1 = t2.CustomerID2 
      and t1.CustomerID2!=t2.CustomerID1 
      and (t1.CustomerID1 != t1.CustomerID2 and t2.CustomerID1 != t2.CustomerID2)) 
     or (t1.CustomerID1 = t2.CustomerID1 
      and t1.CustomerID2 != t2.CustomerID2 
      and (t1.CustomerID1 != t1.CustomerID2)) 
    left join tmp t3 
     on t1.CustomerID1 = t3.CustomerID2 
      and t1.CustomerID2=t3.CustomerID1 
)t1 

我得到下面的結果

enter image description here

注:DENSE_RANK()功能可用從版本2012

+0

不錯的方法,但我想這是一個小馬車:如果您添加另一條記錄在您的TMP「選擇‘774063’作爲CustomerID1,‘774062’作爲CustomerID2,‘0.041’作爲MatchScore」(或774063爲ID1和ID2), ID被搞砸了...... – Tyron78

+0

什麼Tyron78說的是真的,這種方法也適用於這個例子,但在數據的微小變化會給出錯誤的結果。我不相信有一個很好的基於集合的做法,但如果我發現了一個我將它張貼回到這裏 – DataPro

+0

@ Tyron78那可真是一個大發現。我相應地修改了我的答案以實現它。 – Viki888