Here is a common interview question i came across, however i failed to improve it in the way it demands.
assume we have an int array int[] A, we want to find the first duplicate entry.
almost everyone can think of using a HashSet, and add to it while parsing.this will lead to O(n) time and O(n) space. After this I was asked to solve it without other data structures. I said the dumbest idea will be comparing each one in O(n^2) time. And then I was asked to improve the O(n^2) time.
And to improve it, i thought of using a fixed size array(assuming the maximum number is n), boolean[] b = new boolean[n]; however I was not allowed to use this method.
Then I thought of using an int variable, using bit manipulation, if the maximum number is smaller than 32, then for n we can push 1 to n bits left and | to a checker, then & the checker to the next entry in the array to check if it's > 0. e.g.:
int c = A[i]; if(check & (1 << c) > 0) return false; check |= 1 << c;
however this is not allowed either.
So there was an hint that I can use the array itself as hashset/hashtable, and "linear hashing"?
any help ? thanks
This does not use linear hashing, but works faster than O(N2):
Steps 3..5 may be simplified. Just hash elements N+1 .. 3*N/2 and search all other elements of the array in this hash table. Then do the same for elements 3*N/2+1 .. 2*N. This is twice as slow as the original algorithm, but still O(N log N) on average.
Other alternative is to use first N empty elements to construct a binary search tree for elements N+1 .. 3*N/2 and search all other elements of the array in this tree. Then do the same for elements 3*N/2+1 .. 2*N. (This works only if array is small enough and its elements may be indexed by integer values).
Algorithm, described above, is probabilistic and works in O(N log N) time on average. Its worst case complexity is O(N2). The alternative with a binary search tree may have O(N log2 N) worst case complexity if tree is self-balancing. But this is complicated. It is possible to do the task in O(N log2 N) worst case time with simpler algorithm.
This algorithm sequentially iterates through the array and keeps the following invariant: largest possible sub-array with size that is a power-of-two, that fits to the left of current position, starts at index 0 and is sorted; next such sub-array follows it and is also sorted; etc. In other words, binary representation of current index describes how many sorted sub-arrays are preceding it. For example, for index 87 (1010111) we have a single element at index 86, a sorted pair at index 84, a sorted sub-array of 4 elements at 80, a sorted sub-array of 16 elements at 64, and a sorted sub-array of 64 elements at the beginning of the array.
Linear hashing as defined by Wikipedia has the advantage that resizing occurs incrementally, because buckets are split one-by-one in round-robin fashion, retaining constant amortized time complexity for insertion with resize. Their idea therefore is to iterate over the array, reusing the elements already iterated over as storage for linear hashing.
While I am far from an expert on linear hashing, I don't see any way to fit the hash table in the array. Granted, to store n elements with linear hashing, you might get by with using n buckets. However, the number of elements in a bucket being unbounded, you'd need something like a linked list to implement each bucket, which costs an additional O(n) memory for pointers.
As such, this algorithm does not yield a better asymptotic space complexity than an ordinary
HashSet
. It does reduce memory consumption by a constant factor, though.Its time complexity is on par with the ordinary
HashSet
.Edit: It appears to me this answer is being ignored (no votes, no comments). Is it not useful? Please comment so I know what to improve.
I would ask the interviewer(s) why they don't want you using "other data structures" when there is clearly a built-in structure designed for this purpose -
HashSet
.From the Java Documentation on the Collections Framework:
Addendum
Most of the comments below argue that this is merely an exercise - to determine the skills of the programmer. My counterargument to this is simple:
This "interview" is for a Java programming position. Java, being an object-oriented language, has the ability to perform tasks such as these without needing to design a process from scratch (like in C and various other low level languages). In addition, Java is not the best choice when space complexity is a concern. That said, read entry one in my list above again.
Here is O(n) Time on an Average algorithm
Here is the test cases
I believe this is the "linear hashing" solution that your interviewers were looking for. We first need to assume two additional constraints:
With these additional constraints we can solve the problem using less time and space.
Ok, let's get to the code:
What I am doing here is using the array itself to store some additional information. As I iterate through the array, each time I come across a value, I will use that value as an index. At this index I will check the value. If the value is negative, I know that I have been here before (because of the all positive constraint). Therefore I have found my first duplicate, and can exit. Otherwise, I will negate the value at that index.
Pseudo code:
Merge sort worst case O(n log n)
Binary search worst case O(log n)
n times Binary search worst case O(n log n)
Total O(n log n)