I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I
(a) load string data in ELKI from a file (only "Labels")?
(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)
Some code snippets or example input files would be helpful.
It's actually pretty straightforward:
A) write a Parser
that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser
, producing a relation of the desired data type (probably you can just use String
. If you want to be a bit more general TokenSequence
may be a more appropriate concept for these distances. Strings are just the simplest case.
B) implement a DistanceFunction
based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>
. Again, subclassing AbstractPrimitiveDistanceFunction
may be the easiest thing to do.
For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.
A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.