2017-08-31 36 views
3

我將我的Tensorflow圖像分類器的結果保存在SQL數據庫中。我有3張桌子。圖像,類別和一個表格將兩個與權重變量連接起來。有些圖片沒有關係,有些圖片有很多。刪除重複行但保持多對多關係

問題是我在圖像表中有需要刪除的重複行。但是如果重複的圖像有一個或多個,我需要保留多對多的關係。

下面是一個例子:

表名:my_images

+----+------------+-----------------+ 
| ID | image_path | image_filename | 
+----+------------+-----------------+ 
| 1 | Film 1  | Film 1 001.jpg | 
| 2 | Film 1  | Film 1 001.jpg | 
| 3 | Film 1  | Film 1 002.jpg | 
| 4 | Film 1  | Film 1 002.jpg | 
| 5 | Film 1  | Film 1 003.jpg | 
| 6 | Film 1  | Film 1 003.jpg | 
+----+------------+-----------------+ 

表名:my_terms

+---------+------------+ 
| term_id | term_name | 
+---------+------------+ 
|  1 | cat  | 
|  2 | dog  | 
|  3 | automobile | 
+---------+------------+ 

表名:my_term_relationships

+----------+---------+---------+ 
| image_id | term_id | weight | 
+----------+---------+---------+ 
|  2 |  1 | 0.58516 | 
|  2 |  3 | 0.16721 | 
|  3 |  2 | 0.21475 | 
+----------+---------+---------+ 

所以在這個例子中,理想的結果是從my_images刪除第1,4行和第5或6行。

+0

,因爲它是一個很長的時間,因爲我已經做了真正的SQL查詢我不會張貼解答。 我會先創建一個刪除dups的查詢,就像這裏第二個最常見的答案:https://stackoverflow.com/questions/4685173/delete-all-duplicate-rows-except-for-one-in-mysql 然後,我會添加到您的my_term_relationships中選定的ID必須存在的子查詢。 希望它有幫助 – Logar

+0

順便說一下,是否有可能在'my_term_relationships'中的不同id下引用了相同的image_filename?如果是的話,那麼我的上述命題將不起作用。在這種情況下,我建議先清理'my_terms_relationships'表,以便在此表中只有每個image_filename有一個image_id。然後我的上述評論將是相關的我認爲 – Logar

回答

0

您需要查詢兩組圖像ID,並使用它們進行過濾。假設image_pathimage_filename是UNIQUE一起:

  1. 所有my_images ID,即不通過my_term_relationships引用,但相應的image_path + image_filename可能被引用。
  2. 唯一ID,屬於image_path + image_filename對,在my_term_relationships中根本沒有被引用。

在此查詢請看:

DELETE FROM my_images 
WHERE 
    ID NOT IN (SELECT DISTINCT image_id FROM my_term_relationships) -- 1 
    AND 
    ID NOT IN (SELECT id FROM (
    SELECT MIN(ID) as id 
    FROM my_images 
    LEFT JOIN my_term_relationships ON ID = image_id 
    GROUP BY image_path,image_filename 
    HAVING COUNT(image_id) = 0 
    ) as u_ids -- 2 
); 

注意,你必須包裹my_images表中DELETE的其中一個子查詢子句。閱讀此線程解釋:Can't specify target table for update in FROM clause

舉例:從my_term_relationships去除重複行sqlfiddle


示例更新查詢:

UPDATE my_term_relationships 
SET image_id = (
    select min(my_images.ID) 
    from my_images 
    join my_images as ref_image on (my_images.image_path = ref_image.image_path and my_images.image_filename = ref_image.image_filename) 
    where ref_image.ID = image_id 
); 
+0

運行此查詢後,我仍然有一些重複的image_path + image_filename對。也許我在my_term_relationships中有指向重複圖像的行。有沒有合併這些方法? –

+0

然後,在刪除行之前,您需要在my_term_relationships上運行UPDATE。 –

+0

我更新了小提琴:http://sqlfiddle.com/#!9/5c4e3e/3 –

1

方法一步一步來。

首先,找到重複的條目:

SELECT 
image_path, image_filename 
FROM my_images 
GROUP BY image_path, image_filename 
HAVING COUNT(*) > 1 

其次,碰到一些重複的所有行:

SELECT mi.* 
FROM my_images mi 
JOIN (
    SELECT 
    image_path, image_filename 
    FROM my_images 
    GROUP BY image_path, image_filename 
    HAVING COUNT(*) > 1 
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 

最後,得到的ID不刪除。

SELECT MIN(ID) 
FROM my_images mi 
JOIN (
    SELECT 
    image_path, image_filename 
    FROM my_images 
    GROUP BY image_path, image_filename 
    HAVING COUNT(*) > 1 
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id 
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename 
HAVING COUNT(*) > 0 

檢查帶電作業,如果一切是正確的。如果是,請將其轉換爲刪除語句。

DELETE my_images.* FROM my_images 
JOIN (
SELECT MIN(ID) AS ID 
FROM my_images mi 
JOIN (
    SELECT 
    image_path, image_filename 
    FROM my_images 
    GROUP BY image_path, image_filename 
    HAVING COUNT(*) > 1 
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id 
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename 
HAVING COUNT(*) > 0 
) sq USING(ID); 

編輯:還修復洛加爾提到的問題,DELETE語句前使用此UPDATE語句。

UPDATE my_term_relationships mtr 
JOIN (
    SELECT mi.ID, minID 
    FROM my_images mi 
    JOIN (
     SELECT 
     image_path, image_filename, MIN(ID) AS minID 
     FROM my_images 
     GROUP BY image_path, image_filename 
     HAVING COUNT(*) > 1 
    ) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 
) sq ON mtr.image_id = sq.ID 
SET mtr.image_id = sq.minID; 
+0

再次,我相信如果你在'my_term_relationships'中引用了相同文件名的兩個id,你將保留在my_images'中,我會添加第一個查詢來更新'my_term_relationhips'中的id:爲了明白我的意思,在您的小提琴中,更改 VALUE (1,1,0.58516), (2,3,0.16721), (3,2,0.21475)的值。 – Logar

+0

很好的答案,謝謝。但是,當我嘗試刪除查詢時,我收到錯誤「你不能指定目標表'my_images'在FROM子句中更新」 –

+0

@DavidApple修正了錯誤。 – fancyPants