I use mongodb to store compressed html files . Basically, a complete document of mongod is like:
{'_id': 1, 'p1': data, 'p2': data2, 'p3': data3}
where data, data1, data3 are :bson.binary.Binary(zlib_compressed_html)
I have 12 Million ids and dataX are each one average 90KB,
so each document has at least size 180KB + sizeof(_id) + some_overhead
.
The total data size would be at least 2TB.
I would like to notice that '_id'
is index.
I insert to mongo with the following way:
def _save(self, mongo_col, my_id, page, html):
doc = mongo_col.find_one({'_id': my_id})
key = 'p%d' % page
success = False
if doc is None:
doc = {'_id': my_id, key: html}
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
else:
try:
mongo_col.update({'_id': my_id}, {'$set': {key: html}})
success = True
except:
log.exception('Exception updating mongodb')
return success
As you can see first I lookup the collection to see if a document with my_id exists.
If it does not exist then I create it and save it to mongo else I update it.
The problem with the above is that although it was super fast, at some point it became really slow.
I will give you some numbers:
When it was fast I was doing 1.500.000 per 4 hours and after 300.000 per 4 hours.
I suspect that this affects the speed:
Note
When performing update operations that increase the document size beyond the allocated space for that document, the update operation relocates the document on disk and may reorder the document fields depending on the type of update.
As of these driver versions, all write operations will issue a getLastError command to confirm the result of the write operation:
{ getLastError: 1 }
Refer to the documentation on write concern in the Write Operations document for more information.
the above is from : http://docs.mongodb.org/manual/applications/update/
I am saying that because we could have the following :
{'_id: 1, 'p1': some_data}, ...., {'_id': 10000000, 'p2': some_data2}, ...{'_id': N, 'p1': sd3}
and imagine that I am calling the above _save method as:
_save(my_collection, 1, 2, bin_compressed_html)
Then it should update the doc with _id 1 . But if the thing that mongo site is the case, because I am adding a key to the document it does not fit and should rearrange the document.
It is possible to move the document in the end of the collection, which could be very far on the disk. Could this slow things down?
Or speed slow down has to do with the size of the collection?
In any way to you think it should be more efficient to modify my structure to be like:
{'_id': ObjectId, 'mid': 1, 'p': 1, 'd': html}
where mid=my_id, p=page, d=compressed html
and modify _save method to do only inserts?
def _save(self, mongo_col, my_id, page, html):
doc = {'mid': my_id, 'p': page, 'd': html}
success = False
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
return success
this way I avoid the update (so the rearrange on disk) and one lookup (find_one) but the documents would be 3x mores and I would have 2 indexes ( _id and mid ) .
What do you suggest?