I'm using rdist
function from fields
package, but now I want to handle NAs in my matrix, like the dist
function does.
There exist such a function?
One solution would be to use dist
directly, but my matrix has over 150K rows, so this is not an option.
Edit:
Note than removing rows or columns with complete.cases
or na.omit
is not the solution I'm looking for. The intended behaviour is described in the help dist
function:
Missing values are allowed, and are excluded from all computations involving the rows within which they occur. Further, when Inf values are involved, all pairs of values are excluded when their contribution to the distance gave NaN or NA. If some columns are excluded in calculating a Euclidean, Manhattan, Canberra or Minkowski distance, the sum is scaled up proportionally to the number of columns used. If all pairs are excluded when calculating a particular distance, the value is NA.
I add a sample code to ilustrate this. Given this vectors:
vx <- matrix(c(1,2,3), nrow=1)
vy <- matrix(c(2,7,10), nrow=1)
vy.na <- matrix(c(2,NA,10), nrow=1)
dist
calculates the distance ignoring the 2nd column and scaling up to a 3 columns, so
dist(rbind(vx,vy))
dist(rbind(vx,vy.na))
rdist(vx,vy)
all return the same => 8.660254
but
rdist(vx,na.omit(vy.na))
Does not return any distance value because na.omit
omits the whole row.
On the other hand calcuating the distances by pairs of vectors individually is a way slower than rdist
.
My alternate solution is to fill NA with a 'neutral' value (like the median of that column) but I would prefer the dist
behaviour.
Edit
After looking at this post dist handling na's I think it looks like there really isn't anyway to get rdist
to handle NA values. Also, that post looks at how dist
might compensate for removing missing values.
Using this information I wrote this following script.
rdist.alt <- function (x1, x2, na.rm=TRUE)
{
lx <- length(x1)
if (missing(x2))
x2 <- x1
if (!as.matrix(x1))
x1 <- as.matrix(x1)
if (class(x2) == "matrix")
x2 <- as.vector(x2)
if (na.rm)
na.id <- is.na(x1) | is.na(x2)
x1 <- x1[!na.id]
x2 <- x2[!na.id]
lxa <- length(x1)
eucd <- sqrt(sum((x1 - x2)^2) * lx/lxa)
return(eucd)
}
dist(rbind(vx,vy))
dist(rbind(vx,vy.na))
rdist(vx,vy)
rdist.alt(vx,vy)
rdist.alt(vx,vy.na)
all return 8.660254 where as:
rdist(vx,vy.na)
Error in rdist(vx, vy.na) : NA/NaN/Inf in foreign function call (arg 4)
as rdist
does not handle missing values.
However, if you are looking to input a matrix and expect output like this from dist
dist(xy)
1 2 3 4 5 6 7 8 9
2 1.9305914
3 2.2242914 2.8088390
4 3.1357792 2.1320489 2.1348279
5 1.1663478 1.1691107 2.5429175 0.1244745
6 5.0549708 4.1017549 3.4565211 2.0071521 0.1149399
7 6.2926407 5.0060108 4.9231242 3.1572273 1.5159374 1.5263946
8 7.3670783 6.0345762 5.9742805 4.2325789 2.0813721 2.5321716 1.0769671
9 8.0027390 7.1469945 6.1154624 5.0492331 0.8702670 3.0471724 2.6166746 2.3143221
10 9.0061376 8.1080028 7.1207560 6.0279981 0.6962094 4.0210617 3.3833115 2.8031196 1.0075455
You will have to modify the script above. Hope this helps.
---------Original Post-----------------
You can always call na.omit
on the matrix before passing it to the rdist
function.
E.g.
xy <- structure(list(x = c(1L, 2L, 3L, 4L, NA, 6L, 7L, 8L, 9L, 10L), y = c(-1.07436356530045, 0.577054958924021, -2.0477453543004, -0.161614353806037, -0.249631114549562, -0.33090588210086, 0.822298505061525, 1.22212120980467, -0.865002838232734, -0.741925512264102)), .Names = c("x", "y"), row.names = c(NA, -10L), class = "data.frame")
xy2 <- na.omit(xy)
rdist(xy2)
Or if you don't care about retaining the NA values
xy <- na.omit(xy)
rdist(xy)
After reading the answer of @deHaas and his comments I could write an efficient version of rdist
that handles NAs as dist
library(pdist)
rdist.w.na <- function(X,Y)
{
if (!is.matrix(X))
X = as.matrix(X)
if (!is.matrix(Y))
Y = as.matrix(Y)
distances <- matrix(pdist(X,Y)@dist, ncol=nrow(X), byrow = TRUE)
#count NAs
na.count <- sapply(1:nrow(X),function(i){rowSums(is.na(Y) | is.na(X[i,]))})
#scaling to number of cols
distances * sqrt(ncol(X)/(ncol(X) - na.count))
}
In particular rdist.w.na(X,X)
is equivalent to dist(X)
, but it returns a full symmetric matrix instead a lower triangular one.