What's the fastest way in Python (Django) to create a JSON based upon a Django queryset. Note that parsing it in the template as proposed here is not an option.
The background is that I created a method which loops over all nodes in a tree, but is already terribly slow when converting about 300 nodes. The first (and probably the worst) idea that came up to my mind is to create the json somehow "manually". See the code below.
#! Solution 1 !!#
def quoteStr(input):
return "\"" + smart_str(smart_unicode(input)) + "\""
def createJSONTreeDump(user, node, root=False, lastChild=False):
q = "\""
#open tag for object
json = str("\n" + indent + "{" +
quoteStr("name") + ": " + quoteStr(node.name) + ",\n" +
quoteStr("id") + ": " + quoteStr(node.pk) + ",\n" +
)
childrenTag = "children"
children = node.get_children()
if children.count() > 0 :
#create children array opening tag
json += str(indent + quoteStr(childrenTag) + ": [")
#for child in children:
for idx, child in enumerate(children):
if (idx + 1) == children.count():
//recursive call
json += createJSONTreeDump(user, child, False, True, layout)
else:
//recursive call
json += createJSONTreeDump(user, child, False, False, layout)
#add children closing tag
json += "]\n"
#closing tag for object
if lastChild == False:
#more children following, add ","
json += indent + "},\n"
else:
#last child, do not add ","
json += indent + "}\n"
return json
The tree structure to be rendered is a tree build up with mptt, where the call .get_children() returns all children of a node.
The model looks as simply as this, mptt taking care of everything else.
class Node(MPTTModel, ExtraManager):
"""
Representation of a single node
"""
name = models.CharField(max_length=200)
parent = TreeForeignKey('self', null=True, blank=True, related_name='%(app_label)s_%(class)s_children')
The expected JSON result created like this in the template var root = {{ jsonTree|safe }}
Edit: Based upon this answer I created the following code (definitely the better code) but feels only slightly faster.
Solution 2:
def serializable_object(node):
"Recurse into tree to build a serializable object"
obj = {'name': node.name, 'id': node.pk, 'children': []}
for child in node.get_children():
obj['children'].append(serializable_object(child))
return obj
import json
jsonTree = json.dumps(serializable_object(nodeInstance))
Solution 3:
def serializable_object_List_Comprehension(node):
"Recurse into tree to build a serializable object"
obj = {
'name': node.name,
'id': node.pk,
'children': [serializable_object(ch) for ch in node.get_children()]
}
return obj
Solution 4:
def recursive_node_to_dict(node):
result = {
'name': node.name, 'id': node.pk
}
children = [recursive_node_to_dict(c) for c in node.get_children()],
if children is not None:
result['children'] = children
return result
from mptt.templatetags.mptt_tags import cache_tree_children
root_nodes = cache_tree_children(root.get_descendants())
dicts = []
for n in root_nodes:
dicts.append(recursive_node_to_dict(root_nodes[0]))
jsonTree = json.dumps(dicts, indent=4)
Solution 5 (make use of select_related to pre_fetch, whereas not sure if correctly used)
def serializable_object_select_related(node):
"Recurse into tree to build a serializable object, make use of select_related"
obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(), 'id': node.pk, 'level': node.level, 'position': node.position, 'children': []}
for child in node.get_children().select_related():
obj['children'].append(serializable_object(child))
return obj
Solution 6 (improved solution 4, using caching of child nodes):
def recursive_node_to_dict(node):
result = {
'name': node.name, ''id': node.pk,
#notice the use of node._cached_children instead of node.get_children()
'children' : [recursive_node_to_dict(c) for c in node._cached_children]
}
Called via:
from mptt.templatetags.mptt_tags import cache_tree_children
subTrees = cache_tree_children(root.get_descendants(include_self=True))
subTreeDicts = []
for subTree in subTrees:
subTree = recursive_node_to_dict(subTree)
subTreeDicts.append(subTree)
jsonTree = json.dumps(subTreeDicts, indent=4)
#optional clean up, remove the [ ] at the beginning and the end, its needed for D3.js
jsonTree = jsonTree[1:len(jsonTree)]
jsonTree = jsonTree[:len(jsonTree)-1]
Below you can see the profiling results, created using cProfile as suggested by MuMind, setting up a Django view to start the stand-alone method profileJSON(), which in turn calls the different solutions to create the JSON output.
def startProfileJSON(request):
print "startProfileJSON"
import cProfile
cProfile.runctx('profileJSON()', globals=globals(), locals=locals())
print "endProfileJSON"
Results:
Solution 1: 3350347 function calls (3130372 primitive calls) in 4.969 seconds (details)
Solution 2: 2533705 function calls (2354516 primitive calls) in 3.630 seconds (details)
Solution 3: 2533621 function calls (2354441 primitive calls) in 3.684 seconds (details)
Solution 4: 2812725 function calls (2466028 primitive calls) in 3.840 seconds (details)
Solution 5: 2536504 function calls (2357256 primitive calls) in 3.779 seconds (details)
Solution 6 (Improved solution 4): 2593122 function calls (2299165 primitive calls) in 3.663 seconds (details)
Discussion:
Solution 1: own encoding implementation. bad idea
Solution 2 + 3: currently the fastest, but still painfully slow
Solution 4: looks promising with caching childs, but does perform similar and currently produces not valid json as childrens are put into double []:
"children": [[]] instead of "children": []
Solution 5: use of select_related does not make a difference, whereas probably used in the wrong way, as a node always have a ForeignKey to its parent, and we are parsing from root to child.
Update: Solution 6: It looks like the cleanest solution to me, using caching of child nodes. But does only perform similar to solution 2 + 3. Which for me is strange.
Anybody more ideas for performance improvements?
Your updated version looks like there would be very little overhead. I think it would be slightly more efficient (and more readable, too!) to use a list comprehension:
Besides that, all you can do is profile it to find the bottlenecks. Write some standalone code that loads and serializes your 300 nodes and then run it with
(or
-m cProfile
if that works better).The can see 3 different potential bottlenecks:
.get_children()
and.name
) -- I'm not sure exactly what's going on under the hood, but I've had code like this that does a DB query for each node adding a tremendous overhead. If that's your problem, you can probably configure this to do an "eager load" using select_related or something similar.serializable_object
itself) -- Just make sure ncalls forserializable_object
looks like a reasonable number. If I understand your description, it should be in the neighborhood of 300.json.dumps(nodeInstance)
) -- Not a likely culprit since you said it's only 300 nodes, but if you do see this taking up a lot of execution time, make sure you have the compiled speedups for JSON working properly.If you can't tell much from profiling it, make a stripped-down version that, say, recursively calls
node.name
andnode.get_children()
but doesn't store the results in a data structure, and see how that compares.Update: There are 2192 calls to
execute_sql
in solution 3 and 2192 in solution 5, so I think that excessive DB queries is a problem and thatselect_related
did nothing the way it's used above. Looking at django-mptt issue #88: Allow select_related in model methods suggests that you're using it more-or-less right, but I have my doubt, andget_children
vs.get_descendants
might make a huge difference.There's also a ton of time being taken up by
copy.deepcopy
, which is puzzling because you're not directly calling it, and I don't see it called from the MPTT code. What's tree.py?If you're doing a lot of work with profiling, I'd highly recommend the really slick tool RunSnakeRun, which lets you see your profile data in a really convenient grid form and make sense of the data more quickly.
Anyway, here's one more attempt at streamlining the DB side of things:
Note this is no longer recursive. It proceeds iteratively through nodes, theoretically after their parents have been visited, and it's all using one big call to
MPTTModel.get_descendants()
, so hopefully that's well-optimized and caches.parent
, etc. (or maybe there's a more direct way to get at the parent key?). It creates each obj with no children initially, then "grafts" all the values to their parents afterwards.After playing around with this a bit, I found that the solutions were all too slow, because mptt itself is scanning the cache multiple times to
get_children
.Capitalizing on the fact that mptt returns rows in the correct order to easily build trees, I made this:
For my data set, this runs 10x faster than other solutions because it's O(n), only iterating the data once.
I use it like this:
Organise you data into nested dictionaries or lists then call the json dump method:
I suspect by far the biggest slowdown is that this will do 1 database query per node. The json rendering is trivial in comparison to the hundreds of round-trips to your database.
You should cache the children on each node so that those queries can be done all at once. django-mptt has a cache_tree_children() function you can do this with.
Custom json encoding, while it might provide a slight speedup in some scenarios, is something I'd highly discourage, as it will be a lot of code, and it's something that's easy to get very wrong.