I have a collection referencing GridFS files, generally 1-2 files per record. The collections are reasonably large - about 705k records in the parent collection, and 790k GridFS files. Over time, there have become a number of orphaned GridFS files - the parent records were deleted, but the referenced files weren't. I'm now attempting to clean the orphaned files out of the GridFS collection.
The problem with an approach like suggested here is that combining the 700k records into a single large list of ids results in a Python list that's about 4mb in memory - passing that into a $nin query in Mongo on the fs.files collection takes literally forever. Doing the reverse (get a list of all ids in fs.files and querying the parent collection to see if they exist) also takes forever.
Has anybody come up against this and developed a faster solution?
Firstly, let's take the time to consider what GridFS actually is. And as a starter, lets read from the manual page that is referenced:
GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.
So with that out of the way, and that may well be your use case. But the lesson to learn here is that GridFS is not automatically the "go-to" method for storing files.
What has happened here in your case (and others) is because of the "driver level" specification that this is (and MongoDB itself does no magic here), Your "files" have been "split" across two collections. One collection for the main reference to the content, and the other for the "chunks" of data.
Your problem (and others), is that you have managed to leave behind the "chunks" now that the "main" reference has been removed. So with a large number, how to get rid of the orphans.
Your current reading says "loop and compare", and since MongoDB does not do joins, then there really is no other answer. But there are some things that can help.
So rather than run a huge $nin
, try doing a few different things to break this up. Consider working on the reverse order, for example:
db.fs.chunks.aggregate([
{ "$group": { "_id": "$files_id" } },
{ "$limit": 5000 }
])
So what you are doing there is getting the distinct "files_id" values (being the references to fs.files
), from all of the entries, for 5000 of your entries to start with. Then of course you're back to the looping, checking fs.files
for a matching _id
. If something is not found, then remove the documents matching "files_id" from your "chunks".
But that was only 5000, so keep the last id found in that set, because now you are going to run the same aggregate statement again, but differently:
db.fs.chunks.aggregate([
{ "$match": { "files_id": { "$gte": last_id } } },
{ "$group": { "_id": "$files_id" } },
{ "$limit": 5000 }
])
So this works because the ObjectId
values are monotonic or "ever increasing". So all new entries are always greater than the last. Then you can go an loop those values again and do the same deletes where not found.
Will this "take forever". Well yes. You might employ db.eval()
for this, but read the documentation. But overall, this is the price you pay for using two collections.
Back to the start. The GridFS spec is designed this way because it specifically wants to work around the 16MB limitation. But if that is not your limitation, then question why you are using GridFS in the first place.
MongoDB has no problem storing "binary" data within any element of a given BSON document. So you do not need to use GridFS just to store files. And if you had done so, then all of your updates would be completely "atomic", as they only act on one document in one collection at a time.
Since GridFS deliberately splits documents across collections, then if you use it, then you live with the pain. So use it if you need it, but if you do not, then just store the BinData
as a normal field, and these problems go away.
But at least you have a better approach to take than loading everything into memory.
Would like to add my bit to this discussion. Depending on size of difference, you may find it reasonable first to find identities of files, you have to keep first, and than remove chunks, that should not be kept. It may happen when you are managing huge amounts of temporary files.
In my case we have quite an amount of temporary files beeing saved to GridFS on daily basis. We currently have somewhat like 180k temporary files and a few nontemporary. When expiration index hits, we end up with approx. 400k orphans.
Useful thing to know when trying to find those files is that ObjectID is based on timestamp. As so, you can narrow your searches between dates, but enclosing range on _id
or files_id
.
To start looking for files i start with a loop on dates like this:
var nowDate = new Date();
nowDate.setDate(nowDate.getDate()-1);
var startDate = new Date(nowDate);
startDate.setMonth(startDate.getMonth()-1) // -1 month from now
var endDate = new Date(startDate);
endDate.setDate(startDate.getDate()+1); // -1 month +1 day from now
while(endDate.getTime() <= nowDate.getTime()) {
// interior further in this answer
}
Inside I am creating variables to search in range of IDs:
var idGTE = new ObjectID(startDate.getTime()/1000);
var idLT = new ObjectID(endDate.getTime()/1000);
and collecting to variable IDs of files, that does exists in collection .files
:
var found = db.getCollection("collection.files").find({
_id: {
$gte: idGTE,
$lt: idLT
}
}).map(function(o) { return o._id; });
For now I have approx 50 IDs in found
variable. Now, to remove hight amount on orphans in collection of .chunks
, I am loop-searching for 100 IDs to remove as long, as I found nothing:
var removed = 0;
while (true) {
// note that you have to search in a IDs range, to not delete all your files ;)
var idToRemove = db.getCollection("collection.chunks").find({
files_id: {
$gte: idGTE, // important!
$lt: idLT, // important!
$nin: found, // `NOT IN` var found
},
n: 0 // unique ids. Choosen this against aggregate for speed
}).limit(100).map(function(o) { return o.files_id; });
if (idToRemove.length > 0) {
var result = db.getCollection("collection.chunks").remove({
files_id: {
$gte: idGTE, // could be commented
$lt: idLT, // could be commented
$in: idToRemove // `IN` var idToRemove
}
});
removed += result.nRemoved;
} else {
break;
}
}
and afterwards increment dates to get closer to current:
startDate.setDate(startDate.getDate()+1);
endDate.setDate(endDate.getDate()+1);
One thing I can't solve for now is that removing operation is taking quite some time. Finding and removing chunks based on files_id
takes 3-5s per ~200 chunks (100 unique ids). Probably I have to create some smart index to make finds quicker.
Improvement
Packed it into "small" task, that is creating deletion process on mongo server and disconnects. It is obviously a JavaScript, you can send to mongo shell on eg. daily basis:
var startDate = new Date();
startDate.setDate(startDate.getDate()-3) // from -3 days
var endDate = new Date();
endDate.setDate(endDate.getDate()-1); // until yesterday
var idGTE = new ObjectID(startDate.getTime()/1000);
var idLT = new ObjectID(endDate.getTime()/1000);
var found = db.getCollection("collection.files").find({
_id: {
$gte: idGTE,
$lt: idLT
}
}).map(function(o) { return o._id; });
db.getCollection("collection.chunks").deleteMany({
files_id: {
$gte: idGTE,
$lt: idLT,
$nin: found,
}
}, {
writeConcern: {
w: 0 // "fire and forget", allows you to close console.
}
});
EDIT: Using distinct has a 16MB limitation so this may not work if you have a lot of different chunks. In that case you can limit the distinct operation to a subset of UUIDs.
/*
* This function will count orphaned chunks grouping them by file_id.
* This is faster but uses more memory.
*/
function countOrphanedFilesWithDistinct(){
var start = new Date().getTime();
var orphanedFiles = [];
db.documents.chunks.distinct("files_id").forEach(function(id){
var count = db.documents.files.count({ "_id" : id });
if(count===0){
orphanedFiles.push(id);
}
});
var stop = new Date().getTime();
var time = stop-start;
print("Found [ "+orphanedFiles.length+" ] orphaned files in: [ "+time+"ms ]");
}
/*
* This function will delete any orphaned document cunks.
* This is faster but uses more memory.
*/
function deleteOrphanedFilesWithDistinctOneBulkOp(){
print("Building bulk delete operation");
var bulkChunksOp = db.documents.chunks.initializeUnorderedBulkOp();
db.documents.chunks.distinct("files_id").forEach(function(id){
var count = db.documents.files.count({ "_id" : id });
if(count===0){
bulkChunksOp.find({ "files_id" : id }).remove();
}
});
print("Executing bulk delete...");
var result = bulkChunksOp.execute();
print("Num Removed: [ "+result.nRemoved+" ]");
}