Join operation with MongoDB MapReduce

2020-07-11 06:14发布

I've been using MapReduce before to perform classical MR operation, the equivalent of GROUP BY in SQL.

I was wondering if it would be conceptually possible to perform a JOIN operation with MapReduce. Any idea how that could be implemented? Does it make sense to use MapReduce for this kind of operation?

Thanks!

2条回答
一纸荒年 Trace。
2楼-- · 2020-07-11 06:37

MongoDB doesn't support relational operations likes joins. Instead, you can denormalise your data by embedding the rows you'd JOIN on inside the outer document. So instead of joining Products to Sales, you could have a products collection with this schema:

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
    sales:
    [ 
        { id:1, date: "20100316", howMany: 2 },
        { id:2, date: "20100316", howMany: 5 }
    ]
}

Then whenever you retrieve a product, you also get its sales data so there's no need to join or lookup the info somewhere else.

Alternatively, you could split into two collections as you might with a relational database, then use an additional query to get a product's sales, something like this:

SQL: SELECT Sales WHERE ProductId = 123

MongoDB: db.sales.find( { productid: 123 } )

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
}

sales

{
    id: 1,
    productid: 123,
    date: "20100316",
    howMany: 2 
}

{
    id: 2,
    productid: 123,
    date: "20100316",
    howMany: 5
}
查看更多
Deceive 欺骗
3楼-- · 2020-07-11 06:38

My approach is below :

having a look to hadoop I have find CompositeInputFormat approach brefily, it takes two or more collections as an input for map-reduce job

according to my investigation mongodb dont provide this yet. mongodb mapReduce is performed on one colletion at a time.(please correct if I am worng)

so I have decided to put the collections that need to be joined in one collection on wich I will perform the mapreduce for "sql right join"

this is from my log reporter project. the first phase map-reduce is enough to perform right join in case "no clock". the second phase map-reduce has the aim to exclude superfluous right join caused by clock field.

db.test.drop();
db.test.insert({"username" : 1, "day" : 1, "clock" : 0 });
db.test.insert({"username" : 1, "day" : 1, "clock" : 1 });
db.test.insert({"username" : 1,  startDay : 1,endDay:2, "table" : "user" });

//startDay : 1,endDay:2 are used to define the employers working day (join to company - left the company)
//you can use an array instedad of array here. for example day:[1,2,3, ...]

m1 = function(){
   if( typeof this.table!= "undefined" && this.table!=null){
       username = this.username;
       startDay = this.startDay;
       endDay   = this.endDay;
       while(startDay<=endDay){
           emit({username:username,day:startDay},{clocks:["join"]});
          // emit({username:username,day:startDay},1);
           startDay++;
       }
   }else{
       emit({username:this.username,day:this.day},{clocks:[this.clock]});
   }
}
r1 = function(key,values){
    result = {clocks:[]}
    values.forEach(function(x){
        result.clocks = x.clocks.concat(result.clocks);
        result.clocks=result.clocks.filter(function(element, index, array){
            return element!="join";            
        })
    })
    return result;
}

db.test.mapReduce(m1,r1,{out:"result1"})
db.test.find();
db.result1.find();

m2=function(){
   key=this._id;
   this.value.clocks.forEach(function(x){
       key.clock=x;
       emit(key,1);       
   })   
}
r2 = function(key,values){
    value=0;
    values.forEach(function(x){
        value+=1;      
    })
    return result;
}

db.result1.mapReduce(m2,r2,{out:"result2"})
db.test.find();
db.result2.find();
查看更多
登录 后发表回答