刪除mongodb中重複文檔的最快方法

我在mongodb（未來10m +）中有大約170萬個文檔。其中一些代表我不想要的重複條目。的文檔結構是這樣的：刪除mongodb中重複文檔的最快方法

{ 
    _id: 14124412, 
    nodes: [ 
     12345, 
     54321 
     ], 
    name: "Some beauty" 
}

文件被複制，如果有在同一與同名另一個文檔至少一個節點。什麼是刪除重複的最快方法？

來源

2013-01-06 ewooycom

假設你想永久刪除包含從集合重複name + nodes入門文檔，你可以添加一個unique指數與dropDups: true選項：

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

由於文檔說，使用非常謹慎因爲它會刪除數據庫中的數據。首先備份您的數據庫，以防它發生的情況與您預期的完全不同。

UPDATE

該溶液是隻有通過的MongoDB 2.x的有效作爲dropDups選項不再可用在3.0（docs）。

來源

2013-01-06 17:00:44 JohnnyHK

名稱不需要是唯一的。只有當名稱和至少一個節點相同時，這是否會刪除它？ – ewooycom

@ user1188570它是複合的，所以這兩個字段必須在同一個文檔中有重複 – Sammaye

@Sammaye我認爲它是更好的解決方案來合併節點，有沒有像行動：{$ merge：nodes}而不是dropDups？你會如何實現這一目標？ – ewooycom

dropDups: true選項在3.0中不可用。

我有聚合框架的解決方案，收集重複，然後一次刪除。

它可能比系統級「索引」更改稍慢。但考慮想要刪除重複文檔的方式是很好的。

a。一次刪除所有文件

var duplicates = []; 

db.collectionName.aggregate([ 
    { $match: { 
    name: { "$ne": '' } // discard selection criteria 
    }}, 
    { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
    }}, 
    { $match: { 
    count: { "$gt": 1 } // Duplicates considered as count greater than one 
    }} 
], 
{allowDiskUse: true}  // For faster processing if set is larger 
)    // You can display result until this and check duplicates 
.forEach(function(doc) { 
    doc.dups.shift();  // First element skipped for deleting 
    doc.dups.forEach(function(dupId){ 
     duplicates.push(dupId); // Getting all duplicate ids 
     } 
    )  
}) 

// If you want to Check all "_id" which you are deleting else print statement not needed 
printjson(duplicates);  

// Remove all duplicates in one go  
db.collectionName.remove({_id:{$in:duplicates}})

b。您可以刪除的文件一個接一個。

db.collectionName.aggregate([ 
    // discard selection criteria, You can remove "$match" section if you want 
    { $match: { 
    source_references.key: { "$ne": '' } 
    }}, 
    { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
    }}, 
    { $match: { 
    count: { "$gt": 1 } // Duplicates considered as count greater than one 
    }} 
], 
{allowDiskUse: true}  // For faster processing if set is larger 
)    // You can display result until this and check duplicates 
.forEach(function(doc) { 
    doc.dups.shift();  // First element skipped for deleting 
    db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates 
})

來源

2015-10-27 09:38:04

如果結果未被使用意味着，我有TypeError問題 – sara

謝謝你的幫助！我確實發現它更好，當你有很多的行（我有5M行），最好創建一個計數器並限制它爲每個10K，而不是整個重複，因爲它可能太大:) – Mazki516

這看起來很棒！你有任何性能建議？我有大約3M行，幾乎沒有幾次。一次去做（你的解決方案a）還是一個一個地做更好？ – Nico

創建mongodump

清除收集收集轉儲

添加唯一索引

與mongorestore恢復收藏

來源

2016-07-01 06:42:53 dhythhsba

這是我迄今爲止最簡單的方法 - 幾分鐘的宕機時間來節省運行陌生查詢的壓力。 – misaka

這是更簡單，更直觀的方法。謝謝。 – Nerzid

謝謝，我可以澄清，在添加唯一索引後恢復集合意味着嘗試重複條目時不會出現錯誤？ – memebrain

我發現這個解決方案，用MongoDB的3.4工程：我假定重複字段被稱爲fieldX

db.collection.aggregate([ 
{ 
    // only match documents that have this field 
    // you can omit this stage if you don't have missing fieldX 
    $match: {"fieldX": {$nin:[null]}} 
}, 
{ 
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}} 
}, 
{ 
    $replaceRoot: { "newRoot": "$doc"} 
} 
], 
{allowDiskUse:true})

作爲mongoDB的新手，我花了很多時間並使用其他冗長的解決方案來查找和刪除重複項。不過，我認爲這個解決方案是整潔，易於理解。

它通過包含fieldX第一個匹配的文件（我有一些文檔，而這個領域，我得到了一個額外的空的結果）。

下一階段按字段X分組文檔，並且僅使用$$ROOT將$first文檔插入到每個組中。最後，它將使用$ first和$$ ROOT找到的文檔替換整個聚合組。

我不得不添加allowDiskUse，因爲我的集合很大。

您可以在任意數量的管道之後添加此項，儘管$ first的文檔在使用$第一個之前提到了排序階段，但它在沒有它的情況下對我有效。「不可能在這裏發佈一個鏈接，我的名聲小於10 :(」

可以將結果通過添加$搬出臺保存到一個新的集合...

或者，如果僅僅是感興趣的幾個領域如字段1，場2，而不是整個文件，在小組賽中沒有replaceRoot：

db.collection.aggregate([ 
{ 
    // only match documents that have this field 
    $match: {"fieldX": {$nin:[null]}} 
}, 
{ 
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }} 
} 
], 
{allowDiskUse:true})

來源

2017-06-13 13:13:33

這裏有一個稍微更「手動」做這件事的方式：從本質上講

，第一，得到您感興趣的所有唯一密鑰的列表。

然後使用每個鍵執行搜索，如果該搜索返回大於1的值，則刪除。

db.collection.distinct("key").forEach((num)=>{ 
    var i = 0; 
    db.collection.find({key: num}).forEach((doc)=>{ 
     if (i) db.collection.remove({key: num}, { justOne: true }) 
     i++ 
    }) 
    });

來源

2017-08-23 12:42:28 Fernando

總體思路是使用findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/ 從集合中的重複記錄檢索一個隨機ID。
刪除我們從findOne選項中檢索到的random-id以外集合中的所有記錄。

如果您嘗試在pymongo中執行此操作，則可以這樣做。

def _run_query(): 

     try: 

      for record in (aggregate_based_on_field(collection)): 
       if not record: 
        continue 
       _logger.info("Working on Record %s", record) 

       try: 
        retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1})) 
        _logger.info("_id to retain from duplicates %s", retain['_id']) 

        db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}}) 

       except Exception as ex: 
        _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex)) 

     except Exception as e: 
      _logger.error("Mongo error when deleting duplicates %s", str(e)) 


def aggregate_based_on_field(collection): 
    return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

從貝：

更換find_one到findOne
同remove命令應該工作。

來源

2017-11-30 01:49:25 amateur

刪除mongodb中重複文檔的最快方法

回答

相關問題