How do I calculate tree edit distance?

2019-01-16 07:59发布

问题:

I need to calculate the edit distance between trees for a personal project of mine. This paper describes an algorithm, but I can't make heads or tails out of it. Are you aware of any resources that describe an applicable algorithm in a more approachable way? Pseudocode or code would be helpful, too.

回答1:

Here's some java source code (gzipped tarball at the bottom) for a Tree Edit Distance algorithm that might be useful to you.

The page includes references and some slides that go through the "Zhang and Shasha" algorithm step-by-step and other useful links to get you up to speed.

Edit: While this answer was accepted because it pointed to the Zhang-Shasha algorithm, the code in the link has bugs. Steve Johnson and tim.tadh have provided working Python code. See Steve Johnson's comment for more details.



回答2:

(Edited this answer to link to current version of the implementation given by tim.tadh)

This Python library implements the Zhang-Shasha algorithm correctly: https://github.com/timtadh/zhang-shasha

It began as a direct port of the Java source listed in the currently accepted answer (the one with the tarball link), but that implementation is not correct and is nearly impossible to run at all.



回答3:

I wrote an implementation (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram) for those of you who wish to use tree edit distance approximation using PQ-Gram algorithm in the browser and/or in Node.js.

The jqgram tree edit distance approximation module implements the PQ-Gram algorithm for both server-side and browser-side applications; O(n log n) time and O(n) space performant where n is the number of nodes. See the academic paper from which the algorithm comes: (http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf)

The PQ-Gram approximation is much faster than obtaining the true edit distance via Zhang & Shasha, Klein, or Guha et. al, whom provide true edit distance algorithms that all perform minimum O(n^2) time and are therefore often unsuitable.

Often in real-world applications it is not necessary to know the true edit distance if a relative approximation of multiple trees to a known standard can be obtained. Javascript, in the browser and now on the server with the advent of Node.js deal frequently with tree structures and end-user performance is usually critical in algorithm implementation and design; thus jqgram.

Example:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3, depth:10 },
function(result) {
    console.log(result.distance);
});

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, its Javascript, so use it client or server-side!

Now one approach you can use is to use jqgram or PyGram to get a few trees that are close and then go on to use a true edit distance algorithm against a smaller set of trees, why spend all the computation on trees you can already easily determine are very distant, or vice versa. So you can use jqgram to narrow down choices too.

Hope the code helps some folks out.



回答4:

Here you find Java implementations of tree edit distance algorithms:

http://www.inf.unibz.it/dis/projects/tree-edit-distance

In addition to Zhang&Shasha's algorithm of 1989, there are also tree edit distance implementations of more recent algorithms including Klein 1998, Demaine et al. 2009, and the Robust Tree Edit Distance (RTED) algorithm by Pawlik&Augsten, 2011.



回答5:

I made a simple python wrapper (apted.py) for the APTED algorithm using jpype if anyone is interested:

# To use, create a folder named lib next to apted.py, then put APTED.jar into it

import os, os.path, jpype 

global distancePackage
distancePackage = None

global utilPackage
utilPackage = None

def StartJVM():
  # from http://www.gossamer-threads.com/lists/python/python/379020
  root = os.path.abspath(os.path.dirname(__file__)) 
  jpype.startJVM(jpype.getDefaultJVMPath(), 
  "-Djava.ext.dirs=%s%slib" % (root, os.sep))
  global distancePackage
  distancePackage = jpype.JPackage("distance")
  global utilPackage
  utilPackage = jpype.JPackage("util")


def StopJVM():
  jpype.shutdownJVM()


class APTED:
  def __init__(self, delCost, insCost, matchCost):
    global distancePackage
    if distancePackage is None:
      raise Exception("Need to call apted.StartJVM() first")
    self.myApted = distancePackage.APTED(float(delCost), float(insCost), float(matchCost))

  def nonNormalizedTreeDist(self, lblTreeA, lblTreeB):
    return self.myApted.nonNormalizedTreeDist(lblTreeA.myLblTree, lblTreeB.myLblTree)


class LblTree:
  def __init__(self, treeString):
    global utilPackage
    if utilPackage is None:
      raise Exception("Need to call apted.StartJVM() first")

    self.myLblTree = utilPackage.LblTree.fromString(treeString)

'''
# Example usage:

import apted
apted.StartJVM()
aptedDist = apted.APTED(delCost=1, insCost=1, matchCost=1)
treeA = apted.LblTree('{a}')
treeB = apted.LblTree('{b{c}}')
dist = aptedDist.nonNormalizedTreeDist(treeA, treeB)
print dist


# When you are done using apted
apted.StopJVM()
# For some reason it doesn't usually let me start it again and crashes python upon exit when I do this so call only as needed
'''


回答6:

There is a journal version of the ICALP2007 paper you refer to at http://www.wisdom.weizmann.ac.il/~oweimann/Publications/TEDinTALG.pdf This version also has a pseudocode.



回答7:

There are many variations of tree edit distance. If you can go with top-down tree edit distance, which limits insertions and deletes to the leaves, I suggest trying the following paper: http://portal.acm.org/citation.cfm?id=671669&dl=GUIDE&coll=GUIDE&CFID=68408627&CFTOKEN=54202965. The implementation is a straightforward dynamic programming matrix with O(n2) cost.



回答8:

I think you're just looking for the shortest path between two verticies. If so, you can use Dijkstra's algorithm. (The wikipedia page has pseudocode on it for you to refer to.)