比較多個數值列，以確定記錄相似

我通過跨所有列相似性具有帶有ID和數據整數數值從-5到5的列包括0比較多個數值列，以確定記錄相似

╔════╦══════╦══════╦══════╦══════╗ 
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║ 
╠════╬══════╬══════╬══════╬══════╣ 
║ A ║ -5 ║ -2 ║ 0 ║ -2 ║ 
║ B ║ 0 ║ 1 ║ -1 ║ 3 ║ 
║ C ║ 1 ║ -2 ║ -3 ║ 1 ║ 
║ D ║ -1 ║ -1 ║ 5 ║ 0 ║ 
║ E ║ 2 ║ -3 ║ 1 ║ -2 ║ 
║ F ║ -3 ║ 1 ║ -2 ║ -1 ║ 
║ G ║ -4 ║ -1 ║ -1 ║ -3 ║ 
╚════╩══════╩══════╩══════╩══════╝

欲組ID的。例如上面的ID A和G類似，因爲它們在每列中的值非常相似。

╔════╦══════╦══════╦══════╦══════╗ 
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║ 
╠════╬══════╬══════╬══════╬══════╣ 
║ A ║ -5 ║ -2 ║ 0 ║ -2 ║ 
║ G ║ -4 ║ -1 ║ -1 ║ -3 ║ 
╚════╩══════╩══════╩══════╩══════╝

在另一方面A和B是不同的

╔════╦══════╦══════╦══════╦══════╗ 
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║ 
╠════╬══════╬══════╬══════╬══════╣ 
║ A ║ -5 ║ -2 ║ 0 ║ -2 ║ 
║ B ║ 0 ║ 1 ║ -1 ║ 3 ║ 
╚════╩══════╩══════╩══════╩══════╝

對於給定的ID對我正在考慮在每一列中計算的差值，然後將所述差異，以獲得相似性得分（較大數字不太相似）。在這個時候它是我擁有的最好主意，但我更樂於接受更準確或有效的方法。要做到這一點（使用列中的值之差的絕對值）

來源

2014-10-03 Brian Badge

要小心，要使用距離的絕對值，否則一些差異可能會相互抵消，例如：（（5-0）+（0-5））。根據你的定義，這兩者會有所不同，但一個天真的實現將標記它們是相同的。 – Sirko 2014-10-03 18:16:03

爲什麼不求和絕對差值：'score = ABS（5-0）+ ABS（0-5）+ ...' – Rimas 2014-10-03 18:33:03

這裏的麻煩是什麼決定了「相似」，如果每一個都是1是相似的？ 2呢？如果4箇中的3個是相同的，並且其中一個關閉了2，那麼...因此，整個ROW和所有4列的比較......這裏有太多的模糊邏輯來定義「相似」 – xQbert 2014-10-03 19:08:25

一種方法是如下：

with all_compared as (
    select a.id as ID, 
     b.id as CompID, 
     abs(a.col1 - b.col1) + abs(a.col2 - b.col2) + abs(a.col3 - b.col3) + abs(a.col4 - b.col4) as TotalDiff 
    from stuff a, 
     stuff b 
    where a.id != b.id 
), 
    ranked_data as (
    select ID, 
     CompID, 
     TotalDiff, 
     rank() over (partition by ID order by TotalDiff) Rnk 
    from all_compared 
) 
select * 
    from ranked_data 
where rnk = 1;

我已經做了SQL小提琴展示我是如何走到這一步一步的位置： http://sqlfiddle.com/#!4/fef06/14

然後，您將需要決定如何處理的關係，因爲這使輸出：

enter image description here

它使用一個笛卡爾積（一個表中的所有行連接到另一個表中的所有行）與一個自連接進行比較，每行與其他行進行比較，然後彙總col1，2等之間的絕對差值。然後我們通過排序總和差異並選擇最高排名。

另一種方法是使用平方距離而不是絕對差值，這會放大較大的差異，所以您需要考慮是否需要這個。

例 1,1- & 0,5會得到25，爲（0-5）^ 2是25，其將計爲大於0,3 & -4更少相似，-1這將獲得18（3^2 + 3^2）與絕對差異一樣，第一個將被視爲更相似，因爲所有差異都以相同的權重處理。

的平方距離的版本是：

with all_compared as (
    select a.id as ID, 
     b.id as CompID, 
     power(a.col1 - b.col1, 2) + 
      power(a.col2 - b.col2, 2) + 
      power(a.col3 - b.col3, 2) + 
      power(a.col4 - b.col4, 2) as SqDist 
    from stuff a, 
     stuff b 
    where a.id != b.id 
), 
    ranked_data as (
    select ID, 
     CompID, 
     SqDist, 
     rank() over (partition by ID order by SqDist) Rnk 
    from all_compared 
) 
select * 
    from ranked_data 
where rnk = 1;

enter image description here

或者，你可以同時使用，只是使用的平方距離，以解決關係：

with all_compared as (
    select a.id as ID, 
     b.id as CompID, 
     abs(a.col1 - b.col1) + abs(a.col2 - b.col2) + abs(a.col3 - b.col3) + abs(a.col4 - b.col4) as TotalDiff, 
     power(a.col1 - b.col1, 2) + 
      power(a.col2 - b.col2, 2) + 
      power(a.col3 - b.col3, 2) + 
      power(a.col4 - b.col4, 2) as SqDist 
    from stuff a, 
     stuff b 
    where a.id != b.id 
), 
    ranked_data as (
    select ID, 
     CompID, 
     TotalDiff, 
     SqDist, 
     rank() over (partition by ID order by TotalDiff, SqDist) Rnk 
    from all_compared 
) 
select * 
    from ranked_data 
where rnk = 1;

enter image description here

來源

2014-10-03 19:56:58 ChrisProsser

比較多個數值列，以確定記錄相似

回答

相關問題