Long time lurker, first time asker.
I'm trying to calculate 'items in common between 2 sets of items' for a 20M+ items dataset. Sample data looks like this.
#serially numbered items
parents <- rep(1:10000)
#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))
#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
if (numchild[x]>0){
f1 <- sample(1:length(parents), size=numchild[x])
f2 <- list(parents[f1])
parent_child <- c(parent_child, f2)
}
else {
parent_child <- c(parent_child, list(x+1)) #if numchild=0, make up something
}
}
Here is what I want to do: say parent item #1 has 5 children items-- 1,2,3,4,5 and parent item #2 has 3 children item-- 4,10,22.
I want to compute the length(intersection) of every (parent_i, parent_j) combination. In the above case, it would be 1 common item-- 4.
I am doing this for 10M+ parent items that on average have 15-20 children items with a (0,100) range. So that's a 10M x 10M item-item matrix.
I have a foreach loop that I am testing out on a smaller subset that works but doesn't quite scale for the full dataset (64 core machine with 256GB RAM). With the loop below I am already computing only half of the user-user matrix--> (parent_i, parent_j) same as (parent_j, parent_i) for this purpose.
#small subset
a <- parent_child[1:1000]
outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
rest <- a[i+1:length(a)]
foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
common <- length(intersect(b, rest[[j]]))
if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
}
}
I've been experimenting variations on this (using Reduce, storing parent-children in a daataframe etc.) but haven't had much luck.
Is there a way to make this scale?
Here is another approach that is about 10X faster than my previous answer, and 17X faster than the original code (also simpler):
result.3
is identical toresult.2
from previous answer:I reversed the split, so that we have a child-parent relationship
Something like the following constructs a string with pairs of parents sharing a child, across all children
and tallying
or a little more quickly
for
with (this is for all 10000 parent-child elements)
I'm not sure that this would really scale to the size of problem you're talking about, though -- it's polynomial in the number of parents per child.
One possibility for speed-up is to 'memoize' the combinatorial calculation, using the length of the argument as a 'key' and storing the combination as 'value'. This reduces the number of times
combn
is called to the number of unique lengths of elements of child_parent.which helps somewhat
The slow part is now
paste
We can avoid this by encoding the parents with shared child id into a single integer; because of the way floating point numbers are represented in R, this will be exact until about 2^21
and adjusting our combn1 and f2 functions as
leading to
This compares very favorably (note that the timing in the previous line is for 10,000 parent-child relations) with jlhoward's revised answer
and scales in a much more reasonable way.
For what it's worth, the data generation routine is in the second circle of Patrick Burn's R Inferno, using the 'copy-and-append' algorithm rather than pre-allocating the space and filling it in. Avoid this by writing the
for
loop body as a function, and using lapply. Avoid the need for the complicated conditional in thefor
loop by fixing the issue before-handor by sampling from a distribution (rpois, rbinom) that generates positive integer values. Data generation is then
Well, a small improvement (I think):
Original code (wrapped in function call):
Modified code:
Benchmarks:
The numbering for u2 is a little different becasue of the different approaches, but both produce the same vector of matches:
I tried this with data table joins replacing
intersect(...)
and it was actually much slower(!!)