Does a function like dist/rdist exist which handle

2019-03-04 01:35发布

问题:

I'm using rdist function from fields package, but now I want to handle NAs in my matrix, like the dist function does.

There exist such a function?

One solution would be to use dist directly, but my matrix has over 150K rows, so this is not an option.

Edit: Note than removing rows or columns with complete.cases or na.omit is not the solution I'm looking for. The intended behaviour is described in the help dist function:

Missing values are allowed, and are excluded from all computations involving the rows within which they occur. Further, when Inf values are involved, all pairs of values are excluded when their contribution to the distance gave NaN or NA. If some columns are excluded in calculating a Euclidean, Manhattan, Canberra or Minkowski distance, the sum is scaled up proportionally to the number of columns used. If all pairs are excluded when calculating a particular distance, the value is NA.

I add a sample code to ilustrate this. Given this vectors:

vx <- matrix(c(1,2,3), nrow=1)
vy <- matrix(c(2,7,10), nrow=1)
vy.na <- matrix(c(2,NA,10), nrow=1)

dist calculates the distance ignoring the 2nd column and scaling up to a 3 columns, so

dist(rbind(vx,vy))
dist(rbind(vx,vy.na))
rdist(vx,vy)

all return the same => 8.660254

but

rdist(vx,na.omit(vy.na))

Does not return any distance value because na.omit omits the whole row.

On the other hand calcuating the distances by pairs of vectors individually is a way slower than rdist.

My alternate solution is to fill NA with a 'neutral' value (like the median of that column) but I would prefer the dist behaviour.

回答1:

Edit

After looking at this post dist handling na's I think it looks like there really isn't anyway to get rdist to handle NA values. Also, that post looks at how dist might compensate for removing missing values.

Using this information I wrote this following script.

rdist.alt <- function (x1, x2, na.rm=TRUE) 
    {
    lx <- length(x1)
      if (missing(x2)) 
            x2 <- x1
    if (!as.matrix(x1))
        x1 <- as.matrix(x1)
    if (class(x2) == "matrix")
        x2 <- as.vector(x2)
    if (na.rm)
        na.id <- is.na(x1) | is.na(x2)
        x1 <- x1[!na.id]
        x2 <- x2[!na.id]
        lxa <- length(x1)
    eucd <- sqrt(sum((x1 - x2)^2) * lx/lxa)
    return(eucd)
    }

dist(rbind(vx,vy))
dist(rbind(vx,vy.na))
rdist(vx,vy)
rdist.alt(vx,vy)
rdist.alt(vx,vy.na)

all return 8.660254 where as:

rdist(vx,vy.na)
Error in rdist(vx, vy.na) : NA/NaN/Inf in foreign function call (arg 4)

as rdist does not handle missing values.

However, if you are looking to input a matrix and expect output like this from dist

 dist(xy)
           1         2         3         4         5         6         7         8         9
2  1.9305914                                                            
3  2.2242914 2.8088390                                                  
4  3.1357792 2.1320489 2.1348279                                        
5  1.1663478 1.1691107 2.5429175 0.1244745                              
6  5.0549708 4.1017549 3.4565211 2.0071521 0.1149399                    
7  6.2926407 5.0060108 4.9231242 3.1572273 1.5159374 1.5263946          
8  7.3670783 6.0345762 5.9742805 4.2325789 2.0813721 2.5321716 1.0769671
9  8.0027390 7.1469945 6.1154624 5.0492331 0.8702670 3.0471724 2.6166746 2.3143221 
10 9.0061376 8.1080028 7.1207560 6.0279981 0.6962094 4.0210617 3.3833115 2.8031196 1.0075455

You will have to modify the script above. Hope this helps.

---------Original Post-----------------

You can always call na.omit on the matrix before passing it to the rdist function.

E.g.

xy <- structure(list(x = c(1L, 2L, 3L, 4L, NA, 6L, 7L, 8L, 9L, 10L), y = c(-1.07436356530045, 0.577054958924021, -2.0477453543004, -0.161614353806037, -0.249631114549562, -0.33090588210086, 0.822298505061525, 1.22212120980467, -0.865002838232734, -0.741925512264102)), .Names = c("x", "y"), row.names = c(NA, -10L), class = "data.frame")

xy2 <- na.omit(xy)
rdist(xy2)

Or if you don't care about retaining the NA values

xy <- na.omit(xy)
rdist(xy)


回答2:

After reading the answer of @deHaas and his comments I could write an efficient version of rdist that handles NAs as dist

library(pdist)

rdist.w.na <- function(X,Y)
{
  if (!is.matrix(X)) 
    X = as.matrix(X)
  if (!is.matrix(Y)) 
    Y = as.matrix(Y)
  distances <- matrix(pdist(X,Y)@dist, ncol=nrow(X), byrow = TRUE)
  #count NAs
  na.count <- sapply(1:nrow(X),function(i){rowSums(is.na(Y) | is.na(X[i,]))})
  #scaling to number of cols
  distances * sqrt(ncol(X)/(ncol(X) - na.count))
}

In particular rdist.w.na(X,X) is equivalent to dist(X), but it returns a full symmetric matrix instead a lower triangular one.