I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.
I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.
Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?
[An even more complex set of constraints would be even more helpful...]
- For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest
I'd do it with non-equijoins and data.table
EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)
EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.
The final dataset contains each pair, according to the given threshholds
EDIT:
With Additional variables:
If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)
Here's an example joining on two additional categorical variables, species and plot:
NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.
A Mem-efficient solution
One of the issues here for @theforestecologist was that this requires a lot of memory to do,
(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),
However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:
This way we should keep total memory usage low.