fastest way to create JSON to reflect a tree struc

2019-03-08 08:18发布

What's the fastest way in Python (Django) to create a JSON based upon a Django queryset. Note that parsing it in the template as proposed here is not an option.

The background is that I created a method which loops over all nodes in a tree, but is already terribly slow when converting about 300 nodes. The first (and probably the worst) idea that came up to my mind is to create the json somehow "manually". See the code below.

#! Solution 1 !!#
def quoteStr(input):
    return "\"" + smart_str(smart_unicode(input)) + "\""

def createJSONTreeDump(user, node, root=False, lastChild=False):
    q = "\""

    #open tag for object
    json = str("\n" + indent + "{" +
                  quoteStr("name") + ": " + quoteStr(node.name) + ",\n" +
                  quoteStr("id") + ": " + quoteStr(node.pk) + ",\n" +
                )

    childrenTag = "children"
    children = node.get_children()
    if children.count() > 0 :
        #create children array opening tag
        json += str(indent + quoteStr(childrenTag) + ": [")
        #for child in children:
        for idx, child in enumerate(children):
            if (idx + 1) == children.count():
                //recursive call
                json += createJSONTreeDump(user, child, False, True, layout)
            else:
                //recursive call
                json += createJSONTreeDump(user, child, False, False, layout)
        #add children closing tag
        json += "]\n"

    #closing tag for object
    if lastChild == False:
        #more children following, add ","
        json += indent + "},\n"
    else:
        #last child, do not add ","
        json += indent + "}\n"
    return json

The tree structure to be rendered is a tree build up with mptt, where the call .get_children() returns all children of a node.

The model looks as simply as this, mptt taking care of everything else.

class Node(MPTTModel, ExtraManager):
    """
    Representation of a single node
    """ 
    name = models.CharField(max_length=200)
    parent = TreeForeignKey('self', null=True, blank=True, related_name='%(app_label)s_%(class)s_children')

The expected JSON result created like this in the template var root = {{ jsonTree|safe }}

Edit: Based upon this answer I created the following code (definitely the better code) but feels only slightly faster.

Solution 2:

def serializable_object(node):
    "Recurse into tree to build a serializable object"
    obj = {'name': node.name, 'id': node.pk, 'children': []}
    for child in node.get_children():
        obj['children'].append(serializable_object(child))
    return obj

import json
jsonTree = json.dumps(serializable_object(nodeInstance))

Solution 3:

def serializable_object_List_Comprehension(node):
    "Recurse into tree to build a serializable object"
    obj = {
        'name': node.name,
        'id': node.pk,
        'children': [serializable_object(ch) for ch in node.get_children()]
    }
    return obj

Solution 4:

def recursive_node_to_dict(node):
    result = {
        'name': node.name, 'id': node.pk
    }
    children = [recursive_node_to_dict(c) for c in node.get_children()],
    if children is not None:
        result['children'] = children
    return result

from mptt.templatetags.mptt_tags import cache_tree_children
root_nodes = cache_tree_children(root.get_descendants())
dicts = []
for n in root_nodes:
    dicts.append(recursive_node_to_dict(root_nodes[0]))
    jsonTree = json.dumps(dicts, indent=4)

Solution 5 (make use of select_related to pre_fetch, whereas not sure if correctly used)

def serializable_object_select_related(node):
    "Recurse into tree to build a serializable object, make use of select_related"
    obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(), 'id': node.pk, 'level': node.level, 'position': node.position, 'children': []}
    for child in node.get_children().select_related():
        obj['children'].append(serializable_object(child))
    return obj

Solution 6 (improved solution 4, using caching of child nodes):

def recursive_node_to_dict(node):
    result = {
        'name': node.name, ''id': node.pk,
         #notice the use of node._cached_children instead of node.get_children()
        'children' : [recursive_node_to_dict(c) for c in node._cached_children]
    }

Called via:

from mptt.templatetags.mptt_tags import cache_tree_children
subTrees = cache_tree_children(root.get_descendants(include_self=True))
subTreeDicts = []
for subTree in subTrees:
    subTree = recursive_node_to_dict(subTree)
    subTreeDicts.append(subTree)
jsonTree = json.dumps(subTreeDicts, indent=4)
#optional clean up, remove the [ ] at the beginning and the end, its needed for D3.js
jsonTree = jsonTree[1:len(jsonTree)]
jsonTree = jsonTree[:len(jsonTree)-1]

Below you can see the profiling results, created using cProfile as suggested by MuMind, setting up a Django view to start the stand-alone method profileJSON(), which in turn calls the different solutions to create the JSON output.

def startProfileJSON(request):
    print "startProfileJSON"
    import cProfile
    cProfile.runctx('profileJSON()', globals=globals(), locals=locals())
    print "endProfileJSON"

Results:

Solution 1: 3350347 function calls (3130372 primitive calls) in 4.969 seconds (details)

Solution 2: 2533705 function calls (2354516 primitive calls) in 3.630 seconds (details)

Solution 3: 2533621 function calls (2354441 primitive calls) in 3.684 seconds (details)

Solution 4: 2812725 function calls (2466028 primitive calls) in 3.840 seconds (details)

Solution 5: 2536504 function calls (2357256 primitive calls) in 3.779 seconds (details)

Solution 6 (Improved solution 4): 2593122 function calls (2299165 primitive calls) in 3.663 seconds (details)

Discussion:

Solution 1: own encoding implementation. bad idea

Solution 2 + 3: currently the fastest, but still painfully slow

Solution 4: looks promising with caching childs, but does perform similar and currently produces not valid json as childrens are put into double []:

"children": [[]] instead of "children": []

Solution 5: use of select_related does not make a difference, whereas probably used in the wrong way, as a node always have a ForeignKey to its parent, and we are parsing from root to child.

Update: Solution 6: It looks like the cleanest solution to me, using caching of child nodes. But does only perform similar to solution 2 + 3. Which for me is strange.

Anybody more ideas for performance improvements?

4条回答
放荡不羁爱自由
2楼-- · 2019-03-08 09:00

Your updated version looks like there would be very little overhead. I think it would be slightly more efficient (and more readable, too!) to use a list comprehension:

def serializable_object(node):
    "Recurse into tree to build a serializable object"
    obj = {
        'name': node.name,
        'children': [serializable_object(ch) for ch in node.get_children()]
    }
    return obj

Besides that, all you can do is profile it to find the bottlenecks. Write some standalone code that loads and serializes your 300 nodes and then run it with

python -m profile serialize_benchmark.py

(or -m cProfile if that works better).

The can see 3 different potential bottlenecks:

  • DB access (.get_children() and .name) -- I'm not sure exactly what's going on under the hood, but I've had code like this that does a DB query for each node adding a tremendous overhead. If that's your problem, you can probably configure this to do an "eager load" using select_related or something similar.
  • function call overhead (e.g. serializable_object itself) -- Just make sure ncalls for serializable_object looks like a reasonable number. If I understand your description, it should be in the neighborhood of 300.
  • serializing at the end (json.dumps(nodeInstance)) -- Not a likely culprit since you said it's only 300 nodes, but if you do see this taking up a lot of execution time, make sure you have the compiled speedups for JSON working properly.

If you can't tell much from profiling it, make a stripped-down version that, say, recursively calls node.name and node.get_children() but doesn't store the results in a data structure, and see how that compares.


Update: There are 2192 calls to execute_sql in solution 3 and 2192 in solution 5, so I think that excessive DB queries is a problem and that select_related did nothing the way it's used above. Looking at django-mptt issue #88: Allow select_related in model methods suggests that you're using it more-or-less right, but I have my doubt, and get_children vs. get_descendants might make a huge difference.

There's also a ton of time being taken up by copy.deepcopy, which is puzzling because you're not directly calling it, and I don't see it called from the MPTT code. What's tree.py?

If you're doing a lot of work with profiling, I'd highly recommend the really slick tool RunSnakeRun, which lets you see your profile data in a really convenient grid form and make sense of the data more quickly.

Anyway, here's one more attempt at streamlining the DB side of things:

import weakref
obj_cache = weakref.WeakValueDictionary()

def serializable_object(node):
    root_obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(),
            'id': node.pk, 'level': node.level, 'position': node.position,
            'children': []}
    obj_cache[node.pk] = root_obj
    # don't know if the following .select_related() does anything...
    for descendant in node.get_descendants().select_related():
        # get_descendants supposedly traverses in "tree order", which I think
        # means the parent obj will always be created already
        parent_obj = obj_cache[descendant.parent.pk]    # hope parent is cached
        descendant_obj = {'name': descendant.get_wbs_code(),
            'wbsCode': descendant.get_wbs_code(), 'id': descendant.pk,
            'level': descendant.level, 'position': descendant.position,
            'children': []}
        parent_obj['children'].append(descendant_obj)
        obj_cache[descendant.pk] = descendant_obj
    return root_obj

Note this is no longer recursive. It proceeds iteratively through nodes, theoretically after their parents have been visited, and it's all using one big call to MPTTModel.get_descendants(), so hopefully that's well-optimized and caches .parent, etc. (or maybe there's a more direct way to get at the parent key?). It creates each obj with no children initially, then "grafts" all the values to their parents afterwards.

查看更多
够拽才男人
3楼-- · 2019-03-08 09:07

After playing around with this a bit, I found that the solutions were all too slow, because mptt itself is scanning the cache multiple times to get_children.

Capitalizing on the fact that mptt returns rows in the correct order to easily build trees, I made this:

def flat_tree_to_dict(nodes, max_depth):
    tree = []
    last_levels = [None] * max_depth
    for n in nodes:
        d = {'name': n.name}
        if n.level == 0:
            tree.append(d)
        else:
            parent_dict = last_levels[n.level - 1]
            if 'children' not in parent_dict:
                parent_dict['children'] = []
            parent_dict['children'].append(d)
        last_levels[n.level] = d
    return tree

For my data set, this runs 10x faster than other solutions because it's O(n), only iterating the data once.

I use it like this:

json.dumps(flat_tree_to_dict(Model.objects.all(), 4), indent=4)
查看更多
Emotional °昔
4楼-- · 2019-03-08 09:20

Organise you data into nested dictionaries or lists then call the json dump method:

import json   
data = ['foo', {'bar': ('baz', None, 1.0, 2)}]
json.dump(data)
查看更多
姐就是有狂的资本
5楼-- · 2019-03-08 09:22

I suspect by far the biggest slowdown is that this will do 1 database query per node. The json rendering is trivial in comparison to the hundreds of round-trips to your database.

You should cache the children on each node so that those queries can be done all at once. django-mptt has a cache_tree_children() function you can do this with.

import json
from mptt.templatetags.mptt_tags import cache_tree_children

def recursive_node_to_dict(node):
    result = {
        'id': node.pk,
        'name': node.name,
    }
    children = [recursive_node_to_dict(c) for c in node.get_children()]
    if children:
        result['children'] = children
    return result

root_nodes = cache_tree_children(Node.objects.all())
dicts = []
for n in root_nodes:
    dicts.append(recursive_node_to_dict(n))

print json.dumps(dicts, indent=4)

Custom json encoding, while it might provide a slight speedup in some scenarios, is something I'd highly discourage, as it will be a lot of code, and it's something that's easy to get very wrong.

查看更多
登录 后发表回答