ELKI: Running DBSCAN on custom Objects in Java

2019-06-22 02:07发布

问题:

I'm trying to use ELKI from within JAVA to run DBSCAN. For testing I used a FileBasedDatabaseConnection. Now I would like to run DBSCAN with my custom Objects as parameters.

My objects have the following structure:

public class MyObject {
  private Long id;
  private Float param1;
  private Float param2;
  // ... and more parameters as well as getters and setters
}

I'd like to run DBSCAN within ELKI using a List<MyObject> as database, but only some of the parameters should be taken into account (e.g. running DBSCAN on the objects using the parameters param1, param2 and param4). Ideally the resulting clusters contain the whole objects.

Is there any way to achieve this behaviour?

If not, how can I convert the objects into a format that ELKI understands and allows me to match the resulting cluster-objects with my custom objects (i.e. is there an easy way to programmatically set a label)?

The following question speaks of featureVectors: Using ELKI on custom objects and making sense of results
May this be a possible solution for my problem? And how is a feature vector created out of my List<MyObject>?

回答1:

ELKI has a modular architecture.

If you want your own data source, look at the datasource package, and implement the DatabaseConnection (JavaDoc) interface.

If you want to process MyObject objects (the class you shared above will likely come at a substantial performance impact), that is not particularly hard. You need a SimpleTypeInformation<MyObject> (JavaDoc) to identify your data type, and implement a PrimitiveDistanceFunction (JavaDoc) for your data type.

If your actual data are floats, I suggest to use DoubleVector or FloatVector instead, and just use e.g. SubspaceEuclideanDistanceFunction to handle only those attributes you want to use.

For these data types and many distance functions, R*-tree indexes can be used substantially speed up DBSCAN execution time.

A Cluster (JavaDoc) in ELKI never stores the point data. It only stores point DBIDs (Wiki). You can get the point data from the Database relation, or use e.g. offsets (Wiki) to map them back to a list position for static databases.