R function to calculate nearest neighbor distance

2020-07-30 01:47发布

问题:

I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.

I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.

Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?

[An even more complex set of constraints would be even more helpful...]

  • For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest

回答1:

I'd do it with non-equijoins and data.table

EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)

EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1), 
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]

The final dataset contains each pair, according to the given threshholds

EDIT:

With Additional variables:

If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)

Here's an example joining on two additional categorical variables, species and plot:

 library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1,
                            species == species,
                            plot == plot),
              nomatch = NA,
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final

> final
      id         x         y    height species plot  height.1 i.id       i.x       i.y  i.height        dist
  1:   3 0.4837348 0.4325731  91.53387       C    2 111.53387  486 0.5549221 0.4395687 101.53387 0.005116568
  2:  13 0.8267298 0.3137061  94.58949       C    2 114.58949  754 0.8408547 0.2305702 104.58949 0.007111079
  3:  29 0.2905729 0.4952757  89.52128       C    2 109.52128  333 0.2536760 0.5707272  99.52128 0.007054301
  4:  37 0.4534841 0.5249862  89.95493       C    2 109.95493   72 0.4807242 0.6056771  99.95493 0.007253044
  5:  63 0.1678515 0.8814829  84.77450       C    2 104.77450  289 0.1151764 0.9728488  94.77450 0.011122404
 ---                                                                                                        
994: 137 0.8696393 0.2226888  66.57792       C    2  86.57792  473 0.4467795 0.6881008  76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466       A    2 130.14466  338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821       A    2 133.61821  348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652  71.39953       C    3  91.39953  904 0.6954973 0.3690599  81.39953 0.112536771
998: 172 0.6834473 0.6221259  65.52187       A    2  85.52187  783 0.4400028 0.9526355  75.52187 0.168501816
> 

NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.

A Mem-efficient solution

One of the issues here for @theforestecologist was that this requires a lot of memory to do,

(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),

However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]


# for clarity inside the brackets, 
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2 

# Join on a range, must be a cartesian join, since there are many candidates 
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2, 
                                 height <= thresh1,
                                 species,
                                 plot),
              .(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
              by = .EACHI,
              nomatch = NA,
              allow.cartesian = TRUE]

# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]

> test
        id          x          y    height species plot i.id        i.x        i.y  i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
   1:    1 0.17622235 0.66547312  84.68450       B    2  965 0.17410840 0.63219350  93.60226         B      2   74.68450   94.68450           B        2
   2:    2 0.04523011 0.33813054  89.46288       B    2  457 0.07267547 0.35725229  88.42827         B      2   79.46288   99.46288           B        2
   3:    3 0.24096368 0.32649256 103.85870       C    3  202 0.20782303 0.38422814  94.35898         C      3   93.85870  113.85870           C        3
   4:    4 0.53160655 0.06636979 101.50614       B    1  248 0.47382417 0.01535036 103.74101         B      1   91.50614  111.50614           B        1
   5:    5 0.83426727 0.55380451 101.93408       C    3  861 0.78210747 0.52812487  96.71422         C      3   91.93408  111.93408           C        3

This way we should keep total memory usage low.