ELKI - input distance matrix

2019-08-20 03:35发布

问题:

I'm trying to use ELKI for outlier detection ; I have my custom distance matrix and I'm trying to input it to ELKI to perform LOF (for example, in a first time).

I try to follow http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances but it is not very clear to me. What I do:

  • I don't want to load data from database so I use:

    -dbc DBIDRangeDatabaseConnection -idgen.count 100
    

    (where 100 is the number of objects I'll be analyzing)

  • I use LOF algo and call the external distance file

    -algorithm outlier.LOF
    -algorithm.distancefunction external.FileBasedDoubleDistanceFunction
    -distance.matrix testData.ascii -lof.k 3
    

My distance file is as follows (very simple for testing purposes)

0 0 0  
0 1 1  
0 2 0.2  
0 3 0.1  
1 1 0  
1 2 0.9  
1 3 0.9  
2 2 0  
2 3 0.2  
3 3 0  
4 0 0.23  
4 1 0.97  
4 2 0.15  
4 3 0.07  
4 4 0  
5 0 0.1  
5 1 0.85  
5 2 0.02  
5 3 0.15  
5 4 0.1  
5 5 0  
6 0 1  
6 1 1   
6 2 1  
6 3 1  

etc

the results say : "all in one trivial clustering", while this is not clustering and there definitely are outliers in my data.

do I do the stuff right ? Or what am I missing ?

回答1:

When using DBIDRangeDatabaseConnection, and not giving ELKI any actual data, the visualization cannot produce a particularly useful result (because it doesn't have the actual data, after all). Nor can the data be evaluated automatically.

The "all in one trivial clustering" is an artifact from the automatic attempts to visualize the data, but for the reasons discussed above this cannot work. This clustering is automatically added for unlabeled data, to allow some visualizations to work.

There are two things to do for you:

  1. set an output handler. For example -resulthandler ResultWriter, which will produce an output similar to this:

    ID=0 lof-outlier=1.0
    

    Where ID= is the object number, and lof-outlier= is the LOF outlier score.

    Alternatively, you can implement your own output handler. An example is found here: http://elki.dbs.ifi.lmu.de/browser/elki/trunk/src/tutorial/outlier/SimpleScoreDumper.java

  2. fix DBIDRangeDatabaseConnection. You are however bitten by a bug in ELKI 0.6.0~beta1: the DBIDRangeDatabaseConnection actually doesn't initialize its parameters correctly. The trivial bug fix (parameters not initialized correctly in the constructor) is here:

    http://elki.dbs.ifi.lmu.de/changeset/11027/elki

    Alternatively, you can create a dummy input file and use the regular text input. A file containing

    0
    1
    2
    ...
    

    should do the trick. Use -dbc.in numbers100.txt -dbc.filter FixedDBIDsFilter -dbc.startid 0. The latter arguments are to have your IDs start at 0, not 1 (default).

    This workaround will produce a slightly different output format:

    ID=0 0.0 lof-outlier=1.0
    

    where the additional column is from the dummy file. The dummy values will not affect the algorithm result of LOF, when an external distance function is used; but this approach will use some additional memory.



标签: outliers elki