What is the most efficient way to find the size of the intersection of two non-sparse Sets in Java? This is an operation I will be calling on large sets a very large number of times, so optimisation is important. I cannot modify the original sets.
I have looked at Apache Commons CollectionUtils.intersection which appears to be quite slow. My current approach is to take the smaller of the two sets, clone it, and then call .retainAll on the larger of the two sets.
public static int getIntersection(Set<Long> set1, Set<Long> set2) {
boolean set1IsLarger = set1.size() > set2.size();
Set<Long> cloneSet = new HashSet<Long>(set1IsLarger ? set2 : set1);
cloneSet.retainAll(set1IsLarger ? set1 : set2);
return cloneSet.size();
}
If both sets can be sorted, like
TreeSet
running both iterators could be a faster way to count the number of shared objects.If you do this operation often, it might bring a lot if you can wrap the sets so that you can cache the result of the intersection operation keeping a
dirty
flag to track validity of the cached result, calculating again if needed.Can the members of the sets be easily mapped into a relatively small range of integers? If so, consider using BitSets. Intersection then is just bitwise and's - 32 potential members at a time.
Using Java 8 stream:
You can avoid all manual work by using the Set method retainAll().
From docs:
Run some tests with the posted approach and versus constructing a new HashSet. That is, let
A
be the smaller of the sets andB
be the larger set and then, for each item inA
, if it also exists in B then add it to C (a new HashSet) -- for just counting, the intermediate C set can be skipped.Just as the posted approach, this should be a
O(|A|)
in cost as the iteration isO(|A|)
and probe into B isO(1)
. I have no idea how it will compare vs. the clone-and-remove approach.Happy coding -- and post some results ;-)
Actually, on further thinking, I believe this has slightly better bounds than the method in the post:
O(|A|)
vsO(|A| + |B|)
. I have no idea if this will make any difference (or improvement) in actuality and I would only expect it to be relevant when|A| <<< |B|
.Okay, so I was really bored. At least on JDK 7 (Windows 7 x64), it appears the method in presented in the post is slower than the above approach -- by a good (albeit what appears to be mostly constant) factor. My eye-ball guesstimate says it is about four times as slow than the above suggestion that just uses a counter and twice as slow of when creating a new HashSet. This seems to be "roughly consistent" across the different initial set sizes.
(Please keep in mind that, as Voo pointed out, the numbers above and this micro-benchmark assume a HashSet is being used! And, as always, there are dangers with micro-benchmarks. YMMV.)
Here are the ugly results (times in milliseconds):
And here is the ugly (and possibly flawed) micro-benchmark:
Just use Google Guava's
Sets#intersection(Set, Set)
method.