One nearest neighbour using awk

2020-04-11 11:01发布

This is what I am trying to do using AWK language. I have a problem with mainly step 2. I have shown a sample dataset but the original dataset consists of 100 fields and 2000 records.

Algorithm

1) initialize accuracy = 0

2) for each record r

     Find the closest other record, o, in the dataset using distance formula

To find the nearest neighbour for r0, I need to compare r0 with r1 to r9 and do math as follows: square(abs(r0.c1 - r1.c1)) + square(abs(r0.c2 - r1.c2)) + ...+square(abs(r0.c5 - r1.c5)) and store those distance.

3) One with min distance, compare its c6 values. if c6 values are equal increment the accuracy by 1.

After repeating the process for all the records.

4) finally, Get the 1nn accuracy percentage by (accuracy/total_records) * 100;

Sample Dataset

        c1   c2   c3   c4   c5   c6  --> Columns
  r0  0.19 0.33 0.02 0.90 0.12 0.17  --> row1 & row7 nearest neighbour in c1
  r1  0.34 0.47 0.29 0.32 0.20 1.00      and same values in c6(0.3) so ++accuracy
  r2  0.37 0.72 0.34 0.60 0.29 0.15 
  r3  0.43 0.39 0.40 0.39 0.32 0.27 
  r4  0.27 0.41 0.08 0.19 0.10 0.18 
  r5  0.48 0.27 0.68 0.23 0.41 0.25 
  r6  0.52 0.68 0.40 0.75 0.75 0.35 
  r7  0.55 0.59 0.61 0.56 0.74 0.76 
  r8  0.04 0.14 0.03 0.24 0.27 0.37 
  r9  0.39 0.07 0.07 0.08 0.08 0.89

Code

BEGIN   {
            #initialize accuracy and total_records
            accuracy = 0;
            total_records = 10;
        }


NR==FNR {    # Loop through each record and store it in an array
                for (i=1; i<=NF; i++) 
                {
                     records[i]=$i;
                }
            next             
        }

        {   # Re-Loop through the file and compare each record from the array with each record in a file    
              for(i=1; i <= length(records); i++)
              {
                   for (j=1; j<=NF; j++) 
                   {      # here I need to get the difference of each field of the record[i] with each all the records, square them and sum it up. 
                          distance[j] = (records[i] - $j)^2;
                   }
               #Once I have all the distance, I can simply compare the values of field_6 for the record with least distance.
              if(min(distance[j]))
              {
                  if(records[$6] == $6)
                  {
                        ++accuracy;
                  } 
              }
       }
END{
     percentage = 100 * (accuracy/total_records); 
     print percentage;
}

1条回答
Luminary・发光体
2楼-- · 2020-04-11 11:56

Here is one approach

$ cat -n file > nfile
$ join nfile{,} -j99 | 
  awk 'function abs(x) {return x>0?x:-x}  
           $1<$8 {minc=999;for(i=2;i<7;i++) 
                 {d=abs($i-$(i+7)); 
                  if(d<minc)minc=d} 
                  print $1,minc,$7==$14}' | 
  sort -u -k1,2 -k3r | 
  awk '!a[$1]++{sum+=$3} END{print sum}'

7

due to symmetry you just need to compare n*(n-1)/2 records, easier to set it up with join to prepare all matches and filter out the redundant ones $1<$8, finds the min column distance per record and record the match of the last fields $7==$14, to find the minimum distance for each record sort by first record number and distance, finally get the sum of the matched entries.

Here for your formulation I guess the result will be 100*2*7/10=140% since you're double counting (R1~R7 and R7~R1), otherwise 70%

UPDATE
With the new distance function, the script can be re-written as

$ join nfile{,} -j999 | 
  awk '$1<$8 {d=0; 
              for(i=2;i<7;i++) d+=($i-$(i+7))^2; 
              print $1,d,$7==$14}' | 
  sort -k1,2n -k3r | 
  awk '!a[$1]++{sum+=$3;count++} 
            END{print 100*sum/(count+1)"%"}'

70%

Explanation

cat -n file > nfile create a new file with record numbers. join can't take both files from stdin, so you have to create a temporary file.

join nfile{,} -j999 cross product of records (each record will be joined with every record (similar effect of two nested loops)

$1<$8 will filter out the records to upper triangular section of the cross product (if you imagine it as a 2D matrix).

for(i=2;i<7;i++) d+=($i-$(i+7))^2; calculate the distance square of each record with respect to others

print $1,d,$7==$14 print from record, distance square, and indicator whether last fields match

sort -u -k1,2 -k3r find the min for each record, sort 3rd field reverse so that 1 will be first if there is any.

a[$1]++{sum+=$3;count++} count rows and sum the indicators for each from record

END{print 100*sum/(count+1)"%"} the number of fields is one more than from records, convert to percent formatting.

I suggest to understand what is going on run each piped section in stages and try to verify the intermediate results.

For your real data you have to change the hard coded reference values. Joined field should be more than your field count.

查看更多
登录 后发表回答