I am in the process of debugging a library and another implementation which involves computing k-nearest neighbours. I am framing the question with an example which I am having difficulty to understand.
First I will explain demonstrate the thing with a toy example, then show the output which will lead to the question.
Task
The demo here reads a csv file having 10 number of 2-dimensional datapoints. The task is to find the distance of all the datapoints from the first datapoint, and list all the points and the distances from the first datapoint in non-decreasing order.
Basically, this is a component of a kNN based algorithm, and I find a discrepancy when I execute a Java version (component of a library) and when I write it in R. To demonstrate the discrepancy, consider the following codes.
Code 1: Java + WEKA
The following code uses Java and WEKA. I have used LinearNNSearch to compute the nearest neighbours. The cause of using this is because the LinearNNSearch is used in the specific library which I am debugging and/or comparing with the R code.
import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.core.DistanceFunction;
import weka.core.EuclideanDistance;
import weka.core.Instances;
import weka.core.neighboursearch.LinearNNSearch;
import java.io.File;
class testnn
{
public static void main (String args[]) throws Exception
{
// Load csv
CSVLoader loader = new CSVLoader ();
loader.setSource (new File (args[0]));
Instances df = loader.getDataSet ();
// Set the LinearNNSearch object
EuclideanDistance dist_obj = new EuclideanDistance ();
LinearNNSearch lnn = new LinearNNSearch ();
lnn.setDistanceFunction(dist_obj);
lnn.setInstances(df);
lnn.setMeasurePerformance(false);
// Compute the K-nearest neighbours of the first datapoint (index 0).
Instances knn_pts = lnn.kNearestNeighbours (df.instance (0), df.numInstances ());
// Get the distances.
double [] dist_arr = lnn.getDistances ();
// Print
System.out.println ("Points sorted in increasing order from ");
System.out.println (df.instance (0));
System.out.println ("V1,\t" + "V2,\t" + "dist");
for (int j = 0; j < knn_pts.numInstances (); j++)
{
System.out.println (knn_pts.instance (j) + "," + dist_arr[j]);
}
}
}
Code 2: R
To compute the distances I have used dist. Using daisy also gets the identical answer.
// Read file
df <- read.csv ("dat.csv", header = TRUE);
// All to all distances, and select distances of points from first datapoint (index 1)
dist_mat <- as.matrix (dist (df, diag=TRUE, upper=TRUE, method="euclidean"));
first_pt_to_all <- dist_mat[,1];
// Sort the datapoints and also record the ordering
sorted_order <- sort (first_pt_to_all, index.return = TRUE, decreasing = FALSE);
// Prepare dataset with the datapoints ordered in the non-decreasing order of the distance from the first datapoint
df_sorted <- cbind (df[sorted_order$ix[-1],], dist = sorted_order$x[-1]);
// Print
print ("Points sorted in increasing order from ");
print (df[1,]);
print (df_sorted);
Outputs
For easier comparison I am placing the two outputs side by side. Both of the tables display the points in a non-decreasing order.
- The left hand side table is generated by R, with the leftmost column in the R output indicates the original datapoint index.
- The right hand side table is generated by Java + WEKA.
R Java + WEKA [1] "Points sorted in increasing order from " Points sorted in increasing order from V1 V2 1 0.560954 0.313231 0.560954,0.313231 V1 V2 dist V1, V2, dist 5 0.866816 0.476897 0.3468979 0.866816,0.476897,0.3280721928065624 10 0.262637 0.554558 0.3837079 0.262637,0.554558,0.37871658916675316 4 1.038752 0.396173 0.4849436 1.038752,0.396173,0.43517244797543775 2 0.330345 -0.137681 0.5064604 1.053889,0.486349,0.4795184359817083 7 1.053889 0.486349 0.5224507 1.113799,0.42203,0.506782009966262 6 1.113799 0.422030 0.5634490 0.330345,-0.137681,0.5448256434359463 8 0.416051 -0.338858 0.6679947 0.416051,-0.338858,0.7411841020052856 3 0.870481 -0.302856 0.6894709 0.870481,-0.302856,0.7425541767563134 9 1.386459 0.425101 0.8330507 1.386459,0.425101,0.7451474897289354
Problem
the distances are clearly different, and some of the datapoint ordering are also different.
Visualise
I have plotted the 10 points and numbered them according to their sorted order, indicated by the numerals in the plot.
- The black text indicates the points plotted from the sorted dataset generated by R
- The red text indicates the points plotted from the sorted dataset generated by Java + WEKA
Therefore the 4, 5 and 6 differ. If two datapoints were equidistant, then this would have explained the different ordering, but there is no two points which are equidistant from the first datapoint.
Dataset
"V1", "V2" 0.560954,0.313231 0.330345,-0.137681 0.870481,-0.302856 1.038752,0.396173 0.866816,0.476897 1.113799,0.42203 1.053889,0.486349 0.416051,-0.338858 1.386459,0.425101 0.262637,0.554558
Question
- Why the distances in the dist columns are different, which leads to a different ordering of the nearest neighbour points?
- Is there any mistake you can find out in the code, or the way I am using the libraries? Am I using these (especially WEKA) correctly?
Comment if something is unclear or for more information.