2013-02-27 71 views
0

最初,我有一個關係,一個訂單有許多lineitems,許多lineitems只有一個訂單,像往常一樣。SQL到MapReduce:計算多對一關係中的唯一鍵?

使用MongoDB中,我沒有這個文件來代表它:

{ 
    "_id" : ObjectId("511b7d1b3daee1b1446ecdfe"), 
    "l_order" : { 
     "_id" : ObjectId("511b7d133daee1b1446eb54d"), 
     "o_orderkey" : NumberLong(1), 
     "o_totalprice" : 173665.47, 
     "o_orderdate" : ISODate("1996-01-02T03:00:00Z"), 
     "o_orderpriority" : "5-LOW", 
     "o_shippriority" : 0, 
    }, 
    "l_linenumber" : 1, 
    "l_shipdate" : ISODate("1996-03-13T03:00:00Z"), 
    "l_commitdate" : ISODate("1996-02-12T03:00:00Z"), 
    "l_receiptdate" : ISODate("1996-03-22T03:00:00Z"), 
} 

我的本意是翻譯這個sql查詢:

select 
    o_orderpriority, 
    count(*) as order_count 
from 
    orders 
where 
    o_orderdate >= date '1993-07-01' 
    and o_orderdate < date '1993-07-01' + interval '3' month 
    and exists (
     select 
     * 
     from 
     lineitem 
     where 
     l_orderkey = o_orderkey 
     and l_commitdate < l_receiptdate 
    ) 
group by 
    o_orderpriority 
order by 
    o_orderpriority; 

對於這個使用兩種MapReduce函數:

第一個

db.runCommand({ 
    mapreduce: "lineitem", 
    query: { 
     "l_order.o_orderdate": {'$gte': new Date("July 01, 1993"), '$lt': new Date("Oct 01, 1993")} 
    }, 
    map: function Map() { 
       if(this.l_commitdate < this.l_receiptdate){ 
        emit(this.l_order.o_orderkey, this.l_order.o_orderpriority); 
       } 
      }, 
    out: 'query004a' 
}); 

db.runCommand({ 
    mapreduce: "query004a", 
    map: function Map() { 
       /*Remenbering, the value here will be this.l_order.o_orderpriority from the previous mapreduce function*/ 
       emit(this.value, 1); 
      }, 
    reduce: function(key, values) { 
       return Array.sum(values); 
      }, 
    out: 'query004b' 
}); 

在第i個分離的文件片有在時間範圍和尊重該比較,將它們分組爲命令鍵,以避免重複。在第二個我分組的o_orderpriority和總和。

我驚訝的答案是比我期待的更大。但是,爲什麼發生這種情況?

回答

0

在您的第一個地圖函數中,您應該使用'oderpriority'作爲關鍵字,'orderkey'作爲值 - 這會將set減少到第二個mapReduce所需的關鍵點。 (你需要指定一個reduce函數,否則mapReduce會返回一個錯誤)。

所以,這看起來是這樣的:

db.xx.aggregate([ 
    // first "where", this will use an index, if defined 
    { $match: { 
     "l_order.o_orderdate": { $gte: OrderDateMin, $lt: OrderDateMax } 
    }}, 
    // reduce to needed fields, create a field for decision of second "where" 
    { $project: { 
     "key": "$l_order.o_orderkey", 
     "pri": "$l_order.o_orderpriority", 
     okay: { $cond: [ {$lt: ["l_commitdate", "l_receiptdate"]}, 1, 0 ] } 
    }}, 
    // select second where condition matched 
    { $match: { "okay": 1 } }, 
    // group by priority and key 
    { $group: { _id: { "pri": "$pri", "key": "$key" } } }, 
    // group by priority - count entries 
    { $group: { _id: "$_id.pri", "count": { $sum: 1 } } }, 
]) 

OrderDateMin = new Date("1996-01-01"); 
OrderDateMax = new Date("1996-04-01"); 
// first where on oderdate 
query = { 
    "l_order.o_orderdate": {$gte: OrderDateMin, $lt: OrderDateMax} 
} 
map1 = function() { 
    //second "where" on commitdate < receiptdate 
    if (this.l_commitdate < this.l_receiptdate) { 
     // emit orderpriority as key, "1" as counter 
     emit(this.l_order.o_orderpriority, this.l_order.o_orderkey); 
    } 
}; 
reduce1 = function(key, values) { 
    return 1; 
} 
db.runCommand({ 
    mapReduce: "xx", 
    query: query, 
    map: map1, 
    reduce: reduce1, 
    out: 'query004a', 
}) 
map2 = function() { 
    //_id is ordepriority 
    emit(this._id, 1); 
}; 
reduce2 = function(key, values) { 
    // count entries per orderpriority 
    count = 0; 
    values.forEach(function(value) { count += value; }); 
    return count; 
} 
db.runCommand({ 
    mapReduce: "query004a", 
    map: map2, 
    reduce: reduce2, 
    out: 'query004b', 
}) 

現在,同樣可以用一個總的命令,這是更快(用C實現的,而不是在JavaScript)來實現

這將返回類似:

{ "result" : [ { "_id" : "5-LOW", "count" : 1 } ], "ok" : 1 } 

最後,但並非最不重要的:一個建議關於設計:

如果您的結構是相反方向的話,這將會更簡單:一個「訂單」集合,其中訂單項嵌入爲項目數組。這將避免在整個集合中存在重複的訂單數據。

進一步信息:

http://docs.mongodb.org/manual/reference/command/mapReduce/#mapReduce

http://docs.mongodb.org/manual/reference/aggregation

這是否幫助?

乾杯

羅納德