UPDATE 3
Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo's modified version of Steve Hanov's algorithm. Thanks to all that helped!
/**
* Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
* words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
* distance using a Trie" and Murilo Vasconcelo's revised version in C++.
*
* http://stevehanov.ca/blog/index.php?id=114
* http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
*
* @param ArrayList<Character> word - the characters of an input word as an array representation
* @return int - the minimum Levenshtein Distance
*/
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {
theTrie.minLevDist = Integer.MAX_VALUE;
int iWordLength = word.size();
int[] currentRow = new int[iWordLength + 1];
for (int i = 0; i <= iWordLength; i++) {
currentRow[i] = i;
}
for (int i = 0; i < iWordLength; i++) {
traverseTrie(theTrie.root, word.get(i), word, currentRow);
}
return theTrie.minLevDist;
}
/**
* Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
*
* @param TrieNode node - the current TrieNode
* @param char letter - the current character of the current word we're working with
* @param ArrayList<Character> word - an array representation of the current word
* @param int[] previousRow - a row in the Levenshtein Distance matrix
*/
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int minimumElement = currentRow[0];
int insertCost, deleteCost, replaceCost;
for (int i = 1; i < size; i++) {
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
if (word.get(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
if (currentRow[i] < minimumElement) {
minimumElement = currentRow[i];
}
}
if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
theTrie.minLevDist = currentRow[size - 1];
}
if (minimumElement < theTrie.minLevDist) {
for (Character c : node.children.keySet()) {
traverseTrie(node.children.get(c), c, word, currentRow);
}
}
}
UPDATE 2
Finally, I've managed to get this to work for most of my test cases. My implementation is practically a direct translation from Murilo's C++ version of Steve Hanov's algorithm. So how should I refactor this algorithm and/or make optimizations? Below is the code...
public int search(String word) {
theTrie.minLevDist = Integer.MAX_VALUE;
int size = word.length();
int[] currentRow = new int[size + 1];
for (int i = 0; i <= size; i++) {
currentRow[i] = i;
}
for (int i = 0; i < size; i++) {
char c = word.charAt(i);
if (theTrie.root.children.containsKey(c)) {
searchRec(theTrie.root.children.get(c), c, word, currentRow);
}
}
return theTrie.minLevDist;
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int insertCost, deleteCost, replaceCost;
for (int i = 1; i < size; i++) {
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
if (word.charAt(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
}
if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
theTrie.minLevDist = currentRow[size - 1];
}
if (minElement(currentRow) < theTrie.minLevDist) {
for (Character c : node.children.keySet()) {
searchRec(node.children.get(c), c, word, currentRow);
}
}
}
Thank you everyone who contributed to this question. I tried getting the Levenshtein Automata to work, but I couldn't make it happen.
So I'm looking for suggestions on refactoring and/or optimizations regarding the above code. Please let me know if there's any confusion. As always, I can provide the rest of the source code as needed.
UPDATE 1
So I've implemented a simple Trie data structure and I've been trying to follow Steve Hanov's python tutorial to compute the Levenshtein Distance. Actually, I'm interested in computing the minimum Levenshtein Distance between a given word and the words in the Trie, thus I've been following Murilo Vasconcelos's version of Steve Hanov's algorithm. It's not working very well, but here's my Trie class:
public class Trie {
public TrieNode root;
public int minLevDist;
public Trie() {
this.root = new TrieNode(' ');
}
public void insert(String word) {
int length = word.length();
TrieNode current = this.root;
if (length == 0) {
current.isWord = true;
}
for (int index = 0; index < length; index++) {
char letter = word.charAt(index);
TrieNode child = current.getChild(letter);
if (child != null) {
current = child;
} else {
current.children.put(letter, new TrieNode(letter));
current = current.getChild(letter);
}
if (index == length - 1) {
current.isWord = true;
}
}
}
}
... and the TrieNode class:
public class TrieNode {
public final int ALPHABET = 26;
public char letter;
public boolean isWord;
public Map<Character, TrieNode> children;
public TrieNode(char letter) {
this.isWord = false;
this.letter = letter;
children = new HashMap<Character, TrieNode>(ALPHABET);
}
public TrieNode getChild(char letter) {
if (children != null) {
if (children.containsKey(letter)) {
return children.get(letter);
}
}
return null;
}
}
Now, I've tried to implement the search as Murilo Vasconcelos has it, but something is off and I need some help debugging this. Please give suggestions on how to refactor this and/or point out where the bugs are. The very first thing I'd like to refactor is the "minCost" global variable, but that's the smallest of things. Anyway, here's the code...
public void search(String word) {
int size = word.length();
int[] currentRow = new int[size + 1];
for (int i = 0; i <= size; i++) {
currentRow[i] = i;
}
for (int i = 0; i < size; i++) {
char c = word.charAt(i);
if (theTrie.root.children.containsKey(c)) {
searchRec(theTrie.root.children.get(c), c, word, currentRow);
}
}
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int replace, insertCost, deleteCost;
for (int i = 1; i < size; i++) {
char c = word.charAt(i - 1);
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1);
currentRow[i] = minimum(insertCost, deleteCost, replace);
}
if (currentRow[size - 1] < minCost && !node.isWord) {
minCost = currentRow[size - 1];
}
Integer minElement = minElement(currentRow);
if (minElement < minCost) {
for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) {
searchRec(node, entry.getKey(), word, currentRow);
}
}
}
I apologize for the lack of comments. So what am I doing wrong?
INITIAL POST
I've been reading an article, Fast and Easy Levenshtein distance using a Trie, in hopes of figuring out an efficient way to compute the Levenshtein Distance between two Strings. My main goal with this is, given a large set of words, to be able to find the minimal Levenshtein Distance between an input word(s) and this set of words.
In my trivial implementation, I compute the Levenshtein Distance between an input word and the set of words, for each input word, and return the minimum. It works, but it is not efficient...
I've been looking for implementations of a Trie, in Java, and I've come across two seemingly good sources:
However, these implementations seem too complicated for what I'm trying to do. As I've been reading through them to understand how they work and how Trie data structures work in general, I've only become more confused.
So how would I implement a simple Trie data structure in Java? My intuition tells me that each TrieNode should store the String it represents and also references to letters of the alphabet, not necessarily all letters. Is my intuition correct?
Once that is implemented, the next task is to compute the Levenshtein Distance. I read through the Python code example in the article above, but I don't speak Python, and my Java implementation runs out of Heap memory once I hit the recursive searching. So how would I compute the Levenshtein Distance using the Trie data structure? I have a trivial implementation, modeled after this source code, but it doesn't use a Trie... it is inefficient.
It would be really nice to see some code in addition to your comments and suggestions. After all, this is a learning process for me... I've never implemented a Trie... so I have plenty to learn from this experience.
Thanks.
p.s. I can provide any source code if need be. Also, I've already read through and tried using a BK-Tree as suggested in Nick Johnson's blog, but its not as efficient as I think it can be... or maybe my implementation is wrong.
Correct me if I am wrong but I believe your update3 has an extra loop which is unnecesary and makes the program much slower:
You ought to call traverseTrie only once because within traverseTrie you are already looping over the whole word. The code should be only as follows:
The function walk takes a testitem (for example a indexable string, or an array of characters) and a trie. A trie can be an object with two slots. One specifying the node of the trie, the other the children of that node. The children are tries as well. In python it would be something like:
Or in Lisp...
Now a trie looks something like this:
Now the internal function (which you also can write separately) takes the testitem, the children of the root node of the tree (of which the node value is None or whatever), and an initial distance set to 0.
Then we just recursively traverse both branches of the tree, starting left and then right.
Well, here's how I did it a long time ago. I stored the dictionary as a trie, which is simply a finite-state-machine restricted to the form of a tree. You can enhance it by not making that restriction. For example, common suffixes can simply be a shared subtree. You could even have loops, to capture stuff like "nation", "national", "nationalize", "nationalization", ...
Keep the trie as absolutely simple as possible. Don't go stuffing strings in it.
Remember, you don't do this to find the distance between two given strings. You use it to find the strings in the dictionary that are closest to one given string. The time it takes depends on how much levenshtein distance you can tolerate. For distance zero, it is simply O(n) where n is the word length. For arbitrary distance, it is O(N) where N is the number of words in the dictionary.
Here is an example of Levenshtein Automata in Java.These will probably also be helpful:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/
It looks like the experimental Lucene code is based off of the dk.brics.automaton package.
Usage appears to be something similar to below:
In many ways, Steve Hanov's algorithm (presented in the first article linked in the question, Fast and Easy Levenshtein distance using a Trie), the ports of the algorithm made by Murilo and you (OP), and quite possibly every pertinent algorithm involving a Trie or similar structure, function much like a Levenshtein Automaton (which has been mentioned several times here) does:
Steve Hanov's algorithm and its aforementioned derivatives obviously use a Levenshtein distance computation matrix in place of a formal Levenshtein Automaton. Pretty fast, but a formal Levenshtein Automaton can have its parametric states (abstract states which describe the concrete states of the automaton) generated and used for traversal, bypassing any edit-distance-related runtime computation whatsoever. So, it should be run even faster than the aforementioned algorithms.
If you (or anybody else) is interested in a formal Levenshtein Automaton solution, have a look at LevenshteinAutomaton. It implements the aforementioned parametric-state-based algorithm, as well as a pure concrete-state-traversal-based algorithm (outlined above) and dynamic-programming-based algorithms (for both edit distance and neighbor determination). It's maintained by yours truly :) .