Clustering string data with ELKI

2019-07-18 02:33发布

问题:

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I

(a) load string data in ELKI from a file (only "Labels")?

(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)

Some code snippets or example input files would be helpful.

回答1:

It's actually pretty straightforward:

A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.

B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.

For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.

A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.