MongoDb speed decrease

I use mongodb to store compressed html files . Basically, a complete document of mongod is like:

{'_id': 1, 'p1': data, 'p2': data2, 'p3': data3}

where data, data1, data3 are :bson.binary.Binary(zlib_compressed_html)

I have 12 Million ids and dataX are each one average 90KB, so each document has at least size 180KB + sizeof(_id) + some_overhead.

The total data size would be at least 2TB.

I would like to notice that '_id' is index.

I insert to mongo with the following way:

def _save(self, mongo_col, my_id, page, html):
    doc = mongo_col.find_one({'_id': my_id})
    key = 'p%d' % page
    success = False
    if doc is None:
        doc = {'_id': my_id, key: html}
        try:
            mongo_col.save(doc, safe=True)
            success = True
        except:
            log.exception('Exception saving to mongodb')
    else:
        try:
            mongo_col.update({'_id': my_id}, {'$set': {key: html}})
            success = True
        except:
            log.exception('Exception updating  mongodb')
    return success

As you can see first I lookup the collection to see if a document with my_id exists.

If it does not exist then I create it and save it to mongo else I update it.

The problem with the above is that although it was super fast, at some point it became really slow.

I will give you some numbers:

When it was fast I was doing 1.500.000 per 4 hours and after 300.000 per 4 hours.

I suspect that this affects the speed:

Note

When performing update operations that increase the document size beyond the allocated space for that document, the update operation relocates the document on disk and may reorder the document fields depending on the type of update.

As of these driver versions, all write operations will issue a getLastError command to confirm the result of the write operation: { getLastError: 1 } Refer to the documentation on write concern in the Write Operations document for more information.

the above is from : http://docs.mongodb.org/manual/applications/update/

I am saying that because we could have the following :

{'_id: 1, 'p1': some_data}, ...., {'_id': 10000000, 'p2': some_data2}, ...{'_id': N, 'p1': sd3}

and imagine that I am calling the above _save method as:

_save(my_collection, 1, 2, bin_compressed_html)

Then it should update the doc with _id 1 . But if the thing that mongo site is the case, because I am adding a key to the document it does not fit and should rearrange the document.

It is possible to move the document in the end of the collection, which could be very far on the disk. Could this slow things down?

Or speed slow down has to do with the size of the collection?

In any way to you think it should be more efficient to modify my structure to be like:

{'_id': ObjectId, 'mid': 1, 'p': 1, 'd': html}

where mid=my_id, p=page, d=compressed html

and modify _save method to do only inserts?

def _save(self, mongo_col, my_id, page, html):
    doc = {'mid': my_id, 'p': page, 'd': html}
    success = False
    try:
        mongo_col.save(doc, safe=True)
        success = True
    except:
        log.exception('Exception saving to mongodb')
    return success

this way I avoid the update (so the rearrange on disk) and one lookup (find_one) but the documents would be 3x mores and I would have 2 indexes ( _id and mid ) .

What do you suggest?

回答1:

Document relocation could be an issue if you continue to add pages of html as new attributes. Would it really be an issue to move pages to a new collection where you could simply add them one record each? Also I don't really think MongoDB is a good fit for your use case. E.g. Redis would be much more efficient. Another thing you should take care of is to have enough ram for your _id index. Use db.mongocol.stats() to check the index size.

回答2:

When inserting new Documents into MongoDB, a Document can grow without moving it up to a certain point. Because the DB is analyzing the incoming Data and adds a padding to the Document. So do deal with less Document movements you can do two things:

manually tweaking the padding factor
preallocate space (attributes) for each document.

See Article about Padding or MongoDB Docs for more Information about the padding factor.

Btw. insetad of using save for creating new documents, you should use .insert() which will throw a duplicate key error if the _id is already there (.save() will overwrite your document)