I have the following mongoengine model:
class MyModel(Document):
date = DateTimeField(required = True)
data_dict_1 = DictField(required = False)
data_dict_2 = DictField(required = True)
In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).
I have encountered two (possibly related) issues:
- When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.
When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following:
m = MyModel.objects.first() val = m.data_dict_1.get(some_key)
The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing.
I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access. Is there anything I can do to improve this ?
TL;DR: mongoengine is spending ages converting all the returned arrays to dicts
To test this out I built a collection with a document with a
DictField
with a large nesteddict
. The doc being roughly in your 5-10MB range.We can then use
timeit.timeit
to confirm the difference in reads using pymongo and mongoengine.We can then use pycallgraph and GraphViz to see what is taking mongoengine so damn long.
Here is the code in full:
And the output proves that mongoengine is being very slow compared to pymongo:
The resulting call graph illustrates pretty clearly where the bottle neck is:
Essentially mongoengine will call the to_python method on every
DictField
that it gets back from the db.to_python
is pretty slow and in our example it's being called an insane number of times.Mongoengine is used to elegantly map your document structure to python objects. If you have very large unstructured documents (which mongodb is great for) then mongoengine isn't really the right tool and you should just use pymongo.
However, if you know the structure you can use
EmbeddedDocument
fields to get slightly better performance from mongoengine. I've run a similar but not equivalent test code in this gist and the output is:So you can make mongoengine faster but pymongo is much faster still.
UPDATE
A good shortcut to the pymongo interface here is to use the aggregation framework: