I'm trying to run a $graphLookup
like demonstrated in print bellow:
The objective is to, given a specific record (commented $match
there), retrieve it's full "path" throught immediateAncestors
property. As you can see, it's not happening.
I introduced $convert
here to deal with _id
from collection as string
, believing it could be possible to "match" with _id
from immediateAncestors
records list (which is a string
).
So, I did run another test with different data (no ObjectId
s involved):
db.nodos.insert({"id":5,"name":"cinco","children":[{"id":4}]})
db.nodos.insert({"id":4,"name":"quatro","ancestors":[{"id":5}],"children":[{"id":3}]})
db.nodos.insert({"id":6,"name":"seis","children":[{"id":3}]})
db.nodos.insert({"id":1,"name":"um","children":[{"id":2}]})
db.nodos.insert({"id":2,"name":"dois","ancestors":[{"id":1}],"children":[{"id":3}]})
db.nodos.insert({"id":3,"name":"três","ancestors":[{"id":2},{"id":4},{"id":6}]})
db.nodos.insert({"id":7,"name":"sete","children":[{"id":5}]})
And the query:
db.nodos.aggregate( [
{ $match: { "id": 3 } },
{ $graphLookup: {
from: "nodos",
startWith: "$ancestors.id",
connectFromField: "ancestors.id",
connectToField: "id",
as: "ANCESTORS_FROM_BEGINNING"
}
},
{ $project: {
"name": 1,
"id": 1,
"ANCESTORS_FROM_BEGINNING": "$ANCESTORS_FROM_BEGINNING.id"
}
}
] )
...which outputs what I was expecting (the five records directly and indirectly connected to the one with id
3):
{
"_id" : ObjectId("5afe270fb4719112b613f1b4"),
"id" : 3.0,
"name" : "três",
"ANCESTORS_FROM_BEGINNING" : [
1.0,
4.0,
6.0,
5.0,
2.0
]
}
The question is: there is a way to achieve the objetive I mentioned in the beginning?
I'm running Mongo 3.7.9 (from official Docker)
Thanks in advance!
You are currently using a development version of MongoDB which has some features enabled expected to be released with MongoDB 4.0 as an official release. Note that some features may be subject to change before the final release, so production code should be aware of this before you commit to it.
Why $convert fails here
Probably the best way to explain this is to look at your altered sample but replacing with ObjectId
values for _id
and "strings" for those under the the arrays:
{
"_id" : ObjectId("5afe5763419503c46544e272"),
"name" : "cinco",
"children" : [ { "_id" : "5afe5763419503c46544e273" } ]
},
{
"_id" : ObjectId("5afe5763419503c46544e273"),
"name" : "quatro",
"ancestors" : [ { "_id" : "5afe5763419503c46544e272" } ],
"children" : [ { "_id" : "5afe5763419503c46544e277" } ]
},
{
"_id" : ObjectId("5afe5763419503c46544e274"),
"name" : "seis",
"children" : [ { "_id" : "5afe5763419503c46544e277" } ]
},
{
"_id" : ObjectId("5afe5763419503c46544e275"),
"name" : "um",
"children" : [ { "_id" : "5afe5763419503c46544e276" } ]
}
{
"_id" : ObjectId("5afe5763419503c46544e276"),
"name" : "dois",
"ancestors" : [ { "_id" : "5afe5763419503c46544e275" } ],
"children" : [ { "_id" : "5afe5763419503c46544e277" } ]
},
{
"_id" : ObjectId("5afe5763419503c46544e277"),
"name" : "três",
"ancestors" : [
{ "_id" : "5afe5763419503c46544e273" },
{ "_id" : "5afe5763419503c46544e274" },
{ "_id" : "5afe5763419503c46544e276" }
]
},
{
"_id" : ObjectId("5afe5764419503c46544e278"),
"name" : "sete",
"children" : [ { "_id" : "5afe5763419503c46544e272" } ]
}
That should give a general simulation of what you were trying to work with.
What you attempted was to convert the _id
value into a "string" via $project
before entering the $graphLookup
stage. The reason this fails is whilst you did an initial $project
"within" this pipeline, the problem is that the source for $graphLookup
in the "from"
option is still the unaltered collection and therefore you don't get the correct details on the subsequent "lookup" iterations.
db.strcoll.aggregate([
{ "$match": { "name": "três" } },
{ "$addFields": {
"_id": { "$toString": "$_id" }
}},
{ "$graphLookup": {
"from": "strcoll",
"startWith": "$ancestors._id",
"connectFromField": "ancestors._id",
"connectToField": "_id",
"as": "ANCESTORS_FROM_BEGINNING"
}},
{ "$project": {
"name": 1,
"ANCESTORS_FROM_BEGINNING": "$ANCESTORS_FROM_BEGINNING._id"
}}
])
Does not match on the "lookup" therefore:
{
"_id" : "5afe5763419503c46544e277",
"name" : "três",
"ANCESTORS_FROM_BEGINNING" : [ ]
}
"Patching" the problem
However that is the core problem and not a failing of $convert
or it's aliases itself. In order to make this actually work we can instead create a "view" which presents itself as a collection for the sake of input.
I'll do this the other way around and convert the "strings" to ObjectId
via $toObjectId
:
db.createView("idview","strcoll",[
{ "$addFields": {
"ancestors": {
"$ifNull": [
{ "$map": {
"input": "$ancestors",
"in": { "_id": { "$toObjectId": "$$this._id" } }
}},
"$$REMOVE"
]
},
"children": {
"$ifNull": [
{ "$map": {
"input": "$children",
"in": { "_id": { "$toObjectId": "$$this._id" } }
}},
"$$REMOVE"
]
}
}}
])
Using the "view" however means that the data is consistently seen with the values converted. So the following aggregation using the view:
db.idview.aggregate([
{ "$match": { "name": "três" } },
{ "$graphLookup": {
"from": "idview",
"startWith": "$ancestors._id",
"connectFromField": "ancestors._id",
"connectToField": "_id",
"as": "ANCESTORS_FROM_BEGINNING"
}},
{ "$project": {
"name": 1,
"ANCESTORS_FROM_BEGINNING": "$ANCESTORS_FROM_BEGINNING._id"
}}
])
Returns the expected output:
{
"_id" : ObjectId("5afe5763419503c46544e277"),
"name" : "três",
"ANCESTORS_FROM_BEGINNING" : [
ObjectId("5afe5763419503c46544e275"),
ObjectId("5afe5763419503c46544e273"),
ObjectId("5afe5763419503c46544e274"),
ObjectId("5afe5763419503c46544e276"),
ObjectId("5afe5763419503c46544e272")
]
}
Fixing the problem
With all of that said, the real issue here is that you have some data which "looks like" an ObjectId
value and is in fact valid as an ObjectId
, however it has been recorded as a "string". The basic issue to everything working as it should is that the two "types" are not the same and this results in an equality mismatch as the "joins" are attempted.
So the real fix is still the same as it always has been, which is to instead go through the data and fix it so that the "strings" are actually also ObjectId
values. These will then match the _id
keys which they are meant to refer to, and you are saving a considerable amount of storage space since an ObjectId
takes up a lot less space to store than it's string representation in hexadecimal characters.
Using MongoDB 4.0 methods, you "could" actually use the "$toObjectId"
in order to write a new collection, just in much the same matter that we created the "view" earlier:
db.strcoll.aggregate([
{ "$addFields": {
"ancestors": {
"$ifNull": [
{ "$map": {
"input": "$ancestors",
"in": { "_id": { "$toObjectId": "$$this._id" } }
}},
"$$REMOVE"
]
},
"children": {
"$ifNull": [
{ "$map": {
"input": "$children",
"in": { "_id": { "$toObjectId": "$$this._id" } }
}},
"$$REMOVE"
]
}
}}
{ "$out": "fixedcol" }
])
Or of course where you "need" to keep the same collection, then the traditional "loop and update" remains the same as what has always been required:
var updates = [];
db.strcoll.find().forEach(doc => {
var update = { '$set': {} };
if ( doc.hasOwnProperty('children') )
update.$set.children = doc.children.map(e => ({ _id: new ObjectId(e._id) }));
if ( doc.hasOwnProperty('ancestors') )
update.$set.ancestors = doc.ancestors.map(e => ({ _id: new ObjectId(e._id) }));
updates.push({
"updateOne": {
"filter": { "_id": doc._id },
update
}
});
if ( updates.length > 1000 ) {
db.strcoll.bulkWrite(updates);
updates = [];
}
})
if ( updates.length > 0 ) {
db.strcoll.bulkWrite(updates);
updates = [];
}
Which is actually a bit of a "sledgehammer" due to actually overwriting the entire array in a single go. Not a great idea for a production environment, but enough as a demonstration for the purposes of this exercise.
Conclusion
So whilst MongoDB 4.0 will add these "casting" features which can indeed be very useful, their actual intent is not really for cases such as this. They are in fact much more useful as demonstrated in the "conversion" to a new collection using an aggregation pipeline than most other possible uses.
Whilst we "can" create a "view" which transforms the data types to enable things like $lookup
and $graphLookup
to work where the actual collection data differs, this really is only a "band-aid" on the real problem as the data types really should not differ, and should in fact be permanently converted.
Using a "view" actually means that the aggregation pipeline for construction needs to effectively run every time the "collection" ( actually a "view" ) is accessed, which creates a real overhead.
Avoiding overhead is usually a design goal, therefore correcting such data storage mistakes is imperative to getting real performance out of your application, rather than just working with "brute force" that will only slow things down.
A much safer "conversion" script which applied "matched" updates to each array element. The code here requires NodeJS v10.x and a latest release MongoDB node driver 3.1.x:
const { MongoClient, ObjectID: ObjectId } = require('mongodb');
const EJSON = require('mongodb-extended-json');
const uri = 'mongodb://localhost/';
const log = data => console.log(EJSON.stringify(data, undefined, 2));
(async function() {
try {
const client = await MongoClient.connect(uri);
let db = client.db('test');
let coll = db.collection('strcoll');
let fields = ["ancestors", "children"];
let cursor = coll.find({
$or: fields.map(f => ({ [`${f}._id`]: { "$type": "string" } }))
}).project(fields.reduce((o,f) => ({ ...o, [f]: 1 }),{}));
let batch = [];
for await ( let { _id, ...doc } of cursor ) {
let $set = {};
let arrayFilters = [];
for ( const f of fields ) {
if ( doc.hasOwnProperty(f) ) {
$set = { ...$set,
...doc[f].reduce((o,{ _id },i) =>
({ ...o, [`${f}.$[${f.substr(0,1)}${i}]._id`]: ObjectId(_id) }),
{})
};
arrayFilters = [ ...arrayFilters,
...doc[f].map(({ _id },i) =>
({ [`${f.substr(0,1)}${i}._id`]: _id }))
];
}
}
if (arrayFilters.length > 0)
batch = [ ...batch,
{ updateOne: { filter: { _id }, update: { $set }, arrayFilters } }
];
if ( batch.length > 1000 ) {
let result = await coll.bulkWrite(batch);
batch = [];
}
}
if ( batch.length > 0 ) {
log({ batch });
let result = await coll.bulkWrite(batch);
log({ result });
}
await client.close();
} catch(e) {
console.error(e)
} finally {
process.exit()
}
})()
Produces and executes bulk operations like these for the seven documents:
{
"updateOne": {
"filter": {
"_id": {
"$oid": "5afe5763419503c46544e272"
}
},
"update": {
"$set": {
"children.$[c0]._id": {
"$oid": "5afe5763419503c46544e273"
}
}
},
"arrayFilters": [
{
"c0._id": "5afe5763419503c46544e273"
}
]
}
},
{
"updateOne": {
"filter": {
"_id": {
"$oid": "5afe5763419503c46544e273"
}
},
"update": {
"$set": {
"ancestors.$[a0]._id": {
"$oid": "5afe5763419503c46544e272"
},
"children.$[c0]._id": {
"$oid": "5afe5763419503c46544e277"
}
}
},
"arrayFilters": [
{
"a0._id": "5afe5763419503c46544e272"
},
{
"c0._id": "5afe5763419503c46544e277"
}
]
}
},
{
"updateOne": {
"filter": {
"_id": {
"$oid": "5afe5763419503c46544e274"
}
},
"update": {
"$set": {
"children.$[c0]._id": {
"$oid": "5afe5763419503c46544e277"
}
}
},
"arrayFilters": [
{
"c0._id": "5afe5763419503c46544e277"
}
]
}
},
{
"updateOne": {
"filter": {
"_id": {
"$oid": "5afe5763419503c46544e275"
}
},
"update": {
"$set": {
"children.$[c0]._id": {
"$oid": "5afe5763419503c46544e276"
}
}
},
"arrayFilters": [
{
"c0._id": "5afe5763419503c46544e276"
}
]
}
},
{
"updateOne": {
"filter": {
"_id": {
"$oid": "5afe5763419503c46544e276"
}
},
"update": {
"$set": {
"ancestors.$[a0]._id": {
"$oid": "5afe5763419503c46544e275"
},
"children.$[c0]._id": {
"$oid": "5afe5763419503c46544e277"
}
}
},
"arrayFilters": [
{
"a0._id": "5afe5763419503c46544e275"
},
{
"c0._id": "5afe5763419503c46544e277"
}
]
}
},
{
"updateOne": {
"filter": {
"_id": {
"$oid": "5afe5763419503c46544e277"
}
},
"update": {
"$set": {
"ancestors.$[a0]._id": {
"$oid": "5afe5763419503c46544e273"
},
"ancestors.$[a1]._id": {
"$oid": "5afe5763419503c46544e274"
},
"ancestors.$[a2]._id": {
"$oid": "5afe5763419503c46544e276"
}
}
},
"arrayFilters": [
{
"a0._id": "5afe5763419503c46544e273"
},
{
"a1._id": "5afe5763419503c46544e274"
},
{
"a2._id": "5afe5763419503c46544e276"
}
]
}
},
{
"updateOne": {
"filter": {
"_id": {
"$oid": "5afe5764419503c46544e278"
}
},
"update": {
"$set": {
"children.$[c0]._id": {
"$oid": "5afe5763419503c46544e272"
}
}
},
"arrayFilters": [
{
"c0._id": "5afe5763419503c46544e272"
}
]
}
}