I was asked to write my own implementation to remove duplicated values in an array. Here is what I have created. But after tests with 1,000,000 elements it took very long time to finish. Is there something that I can do to improve my algorithm or any bugs to remove ?
I need to write my own implementation - not to use Set
, HashSet
etc. Or any other tools such as iterators. Simply an array to remove duplicates.
public static int[] removeDuplicates(int[] arr) {
int end = arr.length;
for (int i = 0; i < end; i++) {
for (int j = i + 1; j < end; j++) {
if (arr[i] == arr[j]) {
int shiftLeft = j;
for (int k = j+1; k < end; k++, shiftLeft++) {
arr[shiftLeft] = arr[k];
}
end--;
j--;
}
}
}
int[] whitelist = new int[end];
for(int i = 0; i < end; i++){
whitelist[i] = arr[i];
}
return whitelist;
}
Okay, so you cannot use
Set
or other collections. One solution I don't see here so far is one based on the use of a Bloom filter, which essentially is an array of bits, so perhaps that passes your requirements.The Bloom filter is a lovely and very handy technique, fast and space-efficient, that can be used to do a quick check of the existence of an element in a set without storing the set itself or the elements. It has a (typically small) false positive rate, but no false negative rate. In other words, for your question, if a Bloom filter tells you that an element hasn't been seen so far, you can be sure it hasn't. But if it says that an element has been seen, you actually need to check. This still saves a lot of time if there aren't too many duplicates in your list (for those, there is no looping to do, except in the small probability case of a false positive --you typically chose this rate based on how much space you are willing to give to the Bloom filter (rule of thumb: less than 10 bits per unique element for a false positive rate of 1%).
There are many implementations of Bloom filters, see e.g. here or here, so I won't repeat that in this answer. Let us just assume the api described in that last reference, in particular, the description of
put(E e)
:An implementation using such a Bloom filter would then be:
Obviously, if you can alter the incoming array in place, then there is no need for an
ArrayList
: at the end, when you know the actual number of unique elements, justarraycopy()
those.}
Slight modification to the original code itself, by removing the innermost for loop.
Here is my solution. The time complexity is o(n^2)
For a sorted Array, just check the next index:
There exists many solution of this problem.
The sort approach
The set approach
You create a boolean array that represent the items all ready returned, (this depend on your data in the array).
If you deal with large amount of data i would pick the 1. solution. As you do not allocate additional memory and sorting is quite fast. For small set of data the complexity would be n^2 but for large i will be n log n.