I am trying to optimize a piece of code which compares elements of list.
Eg.
public void compare(Set<Record> firstSet, Set<Record> secondSet){
for(Record firstRecord : firstSet){
for(Record secondRecord : secondSet){
// comparing logic
}
}
}
Please take into account that the number of records in sets will be high.
Thanks
Shekhar
There's an O(N) solution for very specific cases where:
The following code assumes that both sets are based on the records comparable. A similar method could be based on on a Comparator.
If you simply want to know if the sets are equal, the
equals
method onAbstractSet
is implemented roughly as below:Note how it optimizes the common cases where:
After that,
containsAll(...)
will returnfalse
as soon as it finds an element in the other set that is not also in this set. But if all elements are present in both sets, it will need to test all of them.The worst case performance therefore occurs when the two sets are equal but not the same objects. That cost is typically
O(N)
orO(NlogN)
depending on the implementation ofthis.containsAll(c)
.And you get close-to-worst case performance if the sets are large and only differ in a tiny percentage of the elements.
UPDATE
If you are willing to invest time in a custom set implementation, there is an approach that can improve the "almost the same" case.
The idea is that you need to pre-calculate and cache a hash for the entire set so that you could get the set's current hashcode value in
O(1)
. Then you can compare the hashcode for the two sets as an acceleration.How could you implement a hashcode like that? Well if the set hashcode was:
then you could cheaply update the set's cached hashcode each time you added or removed an element. In both cases, you simply XOR the element's hashcode with the current set hashcode.
Of course, this assumes that element hashcodes are stable while the elements are members of sets. It also assumes that the element classes hashcode function gives a good spread. That is because when the two set hashcodes are the same you still have to fall back to the
O(N)
comparison of all elements.You could take this idea a bit further ... at least in theory.
WARNING - This is highly speculative. A "thought experiment" if you like.
Suppose that your set element class has a method to return a crypto checksums for the element. Now implement the set's checksums by XORing the checksums returned for the elements.
What does this buy us?
Well, if we assume that nothing underhand is going on, the probability that any two unequal set elements have the same N-bit checksums is 2-N. And the probability 2 unequal sets have the same N-bit checksums is also 2-N. So my idea is that you can implement
equals
as:Under the assumptions above, this will only give you the wrong answer once in 2-N time. If you make N large enough (e.g. 512 bits) the probability of a wrong answer becomes negligible (e.g. roughly 10-150).
The downside is that computing the crypto checksums for elements is very expensive, especially as the number of bits increases. So you really need an effective mechanism for memoizing the checksums. And that could be problematic.
And the other downside is that a non-zero probability of error may be unacceptable no matter how small the probability is. (But if that is the case ... how do you deal with the case where a cosmic ray flips a critical bit? Or if it simultaneously flips the same bit in two instances of a redundant system?)