This is what I am trying to do using AWK language. I have a problem with mainly step 2. I have shown a sample dataset but the original dataset consists of 100 fields and 2000 records.
Algorithm
1) initialize accuracy = 0
2) for each record r
Find the closest other record, o, in the dataset using distance formula
To find the nearest neighbour for r0, I need to compare r0 with r1 to r9 and do math as follows: square(abs(r0.c1 - r1.c1)) + square(abs(r0.c2 - r1.c2)) + ...+square(abs(r0.c5 - r1.c5)) and store those distance.
3) One with min distance, compare its c6 values. if c6 values are equal increment the accuracy by 1.
After repeating the process for all the records.
4) finally, Get the 1nn accuracy percentage by (accuracy/total_records) * 100;
Sample Dataset
c1 c2 c3 c4 c5 c6 --> Columns
r0 0.19 0.33 0.02 0.90 0.12 0.17 --> row1 & row7 nearest neighbour in c1
r1 0.34 0.47 0.29 0.32 0.20 1.00 and same values in c6(0.3) so ++accuracy
r2 0.37 0.72 0.34 0.60 0.29 0.15
r3 0.43 0.39 0.40 0.39 0.32 0.27
r4 0.27 0.41 0.08 0.19 0.10 0.18
r5 0.48 0.27 0.68 0.23 0.41 0.25
r6 0.52 0.68 0.40 0.75 0.75 0.35
r7 0.55 0.59 0.61 0.56 0.74 0.76
r8 0.04 0.14 0.03 0.24 0.27 0.37
r9 0.39 0.07 0.07 0.08 0.08 0.89
Code
BEGIN {
#initialize accuracy and total_records
accuracy = 0;
total_records = 10;
}
NR==FNR { # Loop through each record and store it in an array
for (i=1; i<=NF; i++)
{
records[i]=$i;
}
next
}
{ # Re-Loop through the file and compare each record from the array with each record in a file
for(i=1; i <= length(records); i++)
{
for (j=1; j<=NF; j++)
{ # here I need to get the difference of each field of the record[i] with each all the records, square them and sum it up.
distance[j] = (records[i] - $j)^2;
}
#Once I have all the distance, I can simply compare the values of field_6 for the record with least distance.
if(min(distance[j]))
{
if(records[$6] == $6)
{
++accuracy;
}
}
}
END{
percentage = 100 * (accuracy/total_records);
print percentage;
}