A discrepancy in computing nearest neighbours betw

2019-07-11 05:00发布

问题:

I am in the process of debugging a library and another implementation which involves computing k-nearest neighbours. I am framing the question with an example which I am having difficulty to understand.

First I will explain demonstrate the thing with a toy example, then show the output which will lead to the question.

Task

The demo here reads a csv file having 10 number of 2-dimensional datapoints. The task is to find the distance of all the datapoints from the first datapoint, and list all the points and the distances from the first datapoint in non-decreasing order.

Basically, this is a component of a kNN based algorithm, and I find a discrepancy when I execute a Java version (component of a library) and when I write it in R. To demonstrate the discrepancy, consider the following codes.

Code 1: Java + WEKA

The following code uses Java and WEKA. I have used LinearNNSearch to compute the nearest neighbours. The cause of using this is because the LinearNNSearch is used in the specific library which I am debugging and/or comparing with the R code.

import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.core.DistanceFunction;
import weka.core.EuclideanDistance;
import weka.core.Instances;
import weka.core.neighboursearch.LinearNNSearch;
import java.io.File;

class testnn
{
  public static void main (String args[]) throws Exception
  {
    // Load csv
    CSVLoader loader = new CSVLoader ();
    loader.setSource (new File (args[0]));

    Instances df = loader.getDataSet ();

    // Set the LinearNNSearch object
    EuclideanDistance dist_obj = new EuclideanDistance ();

    LinearNNSearch lnn = new LinearNNSearch ();
    lnn.setDistanceFunction(dist_obj);
    lnn.setInstances(df);
    lnn.setMeasurePerformance(false);

    // Compute the K-nearest neighbours of the first datapoint (index 0).
    Instances knn_pts = lnn.kNearestNeighbours (df.instance (0), df.numInstances ());

    // Get the distances.
    double [] dist_arr = lnn.getDistances ();

    // Print
    System.out.println ("Points sorted in increasing order from ");
    System.out.println (df.instance (0));
    System.out.println ("V1,\t" + "V2,\t" + "dist");
    for (int j = 0; j < knn_pts.numInstances (); j++)
    {
      System.out.println (knn_pts.instance (j) + "," + dist_arr[j]);
    }
  }
}

Code 2: R

To compute the distances I have used dist. Using daisy also gets the identical answer.

// Read file
df <- read.csv ("dat.csv", header = TRUE);

// All to all distances, and select distances of points from  first datapoint (index 1)
dist_mat <- as.matrix (dist (df, diag=TRUE, upper=TRUE, method="euclidean"));
first_pt_to_all <- dist_mat[,1];

// Sort the datapoints and also record the ordering
sorted_order <- sort (first_pt_to_all, index.return = TRUE, decreasing = FALSE);

// Prepare dataset with the datapoints ordered in the non-decreasing order of the distance from the first datapoint
df_sorted <- cbind (df[sorted_order$ix[-1],], dist = sorted_order$x[-1]);

// Print
print ("Points sorted in increasing order from ");
print (df[1,]);

print (df_sorted);

Outputs

For easier comparison I am placing the two outputs side by side. Both of the tables display the points in a non-decreasing order.

  • The left hand side table is generated by R, with the leftmost column in the R output indicates the original datapoint index.
  • The right hand side table is generated by Java + WEKA.
     R                                              Java + WEKA
[1] "Points sorted in increasing order from "   Points sorted in increasing order from 
        V1       V2
1 0.560954 0.313231                      0.560954,0.313231
         V1        V2      dist              V1,        V2,     dist
5  0.866816  0.476897 0.3468979          0.866816,0.476897,0.3280721928065624
10 0.262637  0.554558 0.3837079          0.262637,0.554558,0.37871658916675316
4  1.038752  0.396173 0.4849436          1.038752,0.396173,0.43517244797543775
2  0.330345 -0.137681 0.5064604          1.053889,0.486349,0.4795184359817083
7  1.053889  0.486349 0.5224507          1.113799,0.42203,0.506782009966262
6  1.113799  0.422030 0.5634490          0.330345,-0.137681,0.5448256434359463
8  0.416051 -0.338858 0.6679947          0.416051,-0.338858,0.7411841020052856
3  0.870481 -0.302856 0.6894709          0.870481,-0.302856,0.7425541767563134
9  1.386459  0.425101 0.8330507          1.386459,0.425101,0.7451474897289354

Problem

the distances are clearly different, and some of the datapoint ordering are also different.

Visualise

I have plotted the 10 points and numbered them according to their sorted order, indicated by the numerals in the plot.

  • The black text indicates the points plotted from the sorted dataset generated by R
  • The red text indicates the points plotted from the sorted dataset generated by Java + WEKA

Therefore the 4, 5 and 6 differ. If two datapoints were equidistant, then this would have explained the different ordering, but there is no two points which are equidistant from the first datapoint.

Dataset

"V1", "V2"
0.560954,0.313231
0.330345,-0.137681
0.870481,-0.302856
1.038752,0.396173
0.866816,0.476897
1.113799,0.42203
1.053889,0.486349
0.416051,-0.338858
1.386459,0.425101
0.262637,0.554558

Question

  • Why the distances in the dist columns are different, which leads to a different ordering of the nearest neighbour points?
  • Is there any mistake you can find out in the code, or the way I am using the libraries? Am I using these (especially WEKA) correctly?

Comment if something is unclear or for more information.

回答1:

As noted in the comments, the R distances are correct. The problem is WEKA defaults. You used:

EuclideanDistance dist_obj = new EuclideanDistance ();

Euclidean distance in WEKA has parameters with defaults. One of them is DontNormalize=FALSE, i.e. by default, WEKA normalizes the data before computing the distance. I am not much help in java so I will do this in R. If you scale the data so that for each variable the minimum is zero and the maximum is one, you will get the distance measures provided by WEKA.

NData = Data
NData[,1] = (NData[,1]-min(NData[,1]))/(max(NData[,1])-min(NData[,1]))
NData[,2] = (NData[,2]-min(NData[,2]))/(max(NData[,2])-min(NData[,2]))
dist(NData)

These distances match what you show for WEKA. To get the same result as R, look into the parameters for EuclideanDistance in WEKA.



标签: java r weka knn