Robomongo：超出$組的內存限制

我使用腳本刪除mongo上的重複項，它在一個包含10個項目的集合中工作，我用它作爲測試，但是當我用於包含600萬個文檔的真實集合時，我得到一個錯誤。

這是我在Robomongo運行腳本：

var bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp(); 
var count = 0; 

db.getCollection('RAW_COLLECTION').aggregate([ 
    // Group on unique value storing _id values to array and count 
    { "$group": { 
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" }, 
    "ids": { "$push": "$_id" }, 
    "count": { "$sum": 1 }  
    }}, 
    // Only return things that matched more than once. i.e a duplicate 
    { "$match": { "count": { "$gt": 1 } } } 
]).forEach(function(doc) { 
    var keep = doc.ids.shift();  // takes the first _id from the array 

    bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches 
    count++; 

    if (count % 500 == 0) { // only actually write per 500 operations 
     bulk.execute(); 
     bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp(); // re-init after execute 
    } 
}); 

// Clear any queued operations 
if (count % 500 != 0) 
    bulk.execute();

這是埃羅消息：

Error: command failed: { 
    "errmsg" : "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.", 
    "code" : 16945, 
    "ok" : 0 
} : aggregate failed : 
[email protected]/mongo/shell/utils.js:23:13 
[email protected]/mongo/shell/assert.js:13:14 
[email protected]/mongo/shell/assert.js:266:5 
[email protected]/mongo/shell/collection.js:1215:5 
@(shell):1:1

所以我需要設置allowDiskUse:true工作？我在劇本中該怎麼做，這樣做有什麼危險嗎？

來源

2017-05-24 kadzu

{ allowDiskUse: true }

應該放在聚合管道之後。

在你的代碼這應該是這樣的：

db.getCollection('RAW_COLLECTION').aggregate([ 
    // Group on unique value storing _id values to array and count 
    { "$group": { 
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" }, 
    "ids": { "$push": "$_id" }, 
    "count": { "$sum": 1 }  
    }}, 
    // Only return things that matched more than once. i.e a duplicate 
    { "$match": { "count": { "$gt": 1 } } } 
], { allowDiskUse: true })

來源

2017-05-24 14:50:47 Astro

但是將它設置爲真是安全的嗎？我不明白爲什麼這是必要的 – kadzu

聚合流水線階段有最大的內存使用限制。要處理大型數據集，請將allowDiskUse選項設置爲true以啓用將數據寫入臨時文件。與從內存中完全讀取時相比，這應該會有不同的表現。還取決於數據集大小 – Astro