I have two data frames, df1
with reference data and df2
with new data. For each row in df2
, I need to find the best (and the second best) matching row to df1
in terms of hamming distance.
I used e1071
package to compute hamming distance. Hamming distance between two vectors x
and y
can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1
and df2
. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2
:
- a column with the row id from
df1
that gives the lowest distance; - a column with the lowest distance;
- a column with the row id from
df1
that gives the 2nd lowest distance; - a column with the second lowest distance.
Thanks.
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function
hmd0
that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:we will end up with a scalar:
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function
hmd
will do:Now
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
The distance matrix between
x
andy
is:We can also do:
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using
outer
is inefficient here, but it is still more efficient than writing R loops. Since ourhamming.distance
is written in R code, I would stay with usingouter
. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version ofhamming.distance
, but I will not show it here.Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
To call the function on some random data:
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
to compute hamming distance between two integers vectors of equal length
x
andy
. This only uses R base, yet is more efficient thane1071::hamming.distance
, because it is vectorized!For the example
x
andy
in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single
y
and multiplex
s, i.e., the hamming distance between a vector and a matrix, we can use the following function.Note that:
hmd
performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;tapply()
.Fast hamming distance computation between two matrices/data frames
This is what you want. The following function
foo
takes two data frames or matricesdf1
anddf2
, computing the distance betweendf1
and each row ofdf2
. argumentp
is an integer, showing how many results you want to retain.p = 3
will keep the smallest 3 distances with their row ids indf1
.Note that:
for
loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using*apply
family, since we ask for multiple output (row idid
and distanced
).Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
Test
hmd
first (needs transposition):Test
foo
:If you want to append some columns to
df2
, you know what to do, right?