Finding the first duplicate in an int array, java

2020-05-16 04:20发布

Here is a common interview question i came across, however i failed to improve it in the way it demands.

assume we have an int array int[] A, we want to find the first duplicate entry. 
  1. almost everyone can think of using a HashSet, and add to it while parsing.this will lead to O(n) time and O(n) space. After this I was asked to solve it without other data structures. I said the dumbest idea will be comparing each one in O(n^2) time. And then I was asked to improve the O(n^2) time.

  2. And to improve it, i thought of using a fixed size array(assuming the maximum number is n), boolean[] b = new boolean[n]; however I was not allowed to use this method.

  3. Then I thought of using an int variable, using bit manipulation, if the maximum number is smaller than 32, then for n we can push 1 to n bits left and | to a checker, then & the checker to the next entry in the array to check if it's > 0. e.g.:

    int c = A[i];
    if(check & (1 << c) > 0) return false;
    check |= 1 << c;
    

however this is not allowed either.

So there was an hint that I can use the array itself as hashset/hashtable, and "linear hashing"?

any help ? thanks

9条回答
▲ chillily
2楼-- · 2020-05-16 04:49

This does not use linear hashing, but works faster than O(N2):

  1. Choose some small number C and use a brute-force algorithm to find first duplicate for the first C elements of the array. Clear first C elements if nothing found yet.
  2. Perform the remaining steps with first N elements empty. Initially, N=C. After each iteration, N is doubled.
  3. Sequentially add numbers from indexes N+1 .. 3*N/2 to the hash table in first N array elements. Use open addressing. After all N/2 elements moved, hash load factor should be 1/2. Clear space, occupied by N/2 elements we just moved. For the next N/4 elements, search each of them in the hash table(s), constructed so far, then hash them to the space which is always twice as much as number of elements. Continue this until N-C array elements are hashed. Search the remaining C elements in the hash tables and compare them to each other.
  4. Now we have N array elements without duplicates, occupying 2*N space. Rehash them in-place.
  5. Sequentially search all other elements of the array in this hash table. Then clear these 2*N elements, set N=2*N, and continue with step 3.

Steps 3..5 may be simplified. Just hash elements N+1 .. 3*N/2 and search all other elements of the array in this hash table. Then do the same for elements 3*N/2+1 .. 2*N. This is twice as slow as the original algorithm, but still O(N log N) on average.

Other alternative is to use first N empty elements to construct a binary search tree for elements N+1 .. 3*N/2 and search all other elements of the array in this tree. Then do the same for elements 3*N/2+1 .. 2*N. (This works only if array is small enough and its elements may be indexed by integer values).


Algorithm, described above, is probabilistic and works in O(N log N) time on average. Its worst case complexity is O(N2). The alternative with a binary search tree may have O(N log2 N) worst case complexity if tree is self-balancing. But this is complicated. It is possible to do the task in O(N log2 N) worst case time with simpler algorithm.

This algorithm sequentially iterates through the array and keeps the following invariant: largest possible sub-array with size that is a power-of-two, that fits to the left of current position, starts at index 0 and is sorted; next such sub-array follows it and is also sorted; etc. In other words, binary representation of current index describes how many sorted sub-arrays are preceding it. For example, for index 87 (1010111) we have a single element at index 86, a sorted pair at index 84, a sorted sub-array of 4 elements at 80, a sorted sub-array of 16 elements at 64, and a sorted sub-array of 64 elements at the beginning of the array.

  1. Iterate through the array
  2. Search current element in all preceding sub-arrays using binary search.
  3. Sort current element together with those preceding sub-arrays, that correspond to trailing "ones" in the binary representation of current index. For example, for index 87 (1010111), we need to sort current element together with 3 sub-arrays (1+1+2+4=8 elements). This step allows adding current element to sub-arrays while keeping algorithm's invariant.
  4. Continue with next iteration of step 1.
查看更多
甜甜的少女心
3楼-- · 2020-05-16 04:55

Linear hashing as defined by Wikipedia has the advantage that resizing occurs incrementally, because buckets are split one-by-one in round-robin fashion, retaining constant amortized time complexity for insertion with resize. Their idea therefore is to iterate over the array, reusing the elements already iterated over as storage for linear hashing.

While I am far from an expert on linear hashing, I don't see any way to fit the hash table in the array. Granted, to store n elements with linear hashing, you might get by with using n buckets. However, the number of elements in a bucket being unbounded, you'd need something like a linked list to implement each bucket, which costs an additional O(n) memory for pointers.

As such, this algorithm does not yield a better asymptotic space complexity than an ordinary HashSet. It does reduce memory consumption by a constant factor, though.

Its time complexity is on par with the ordinary HashSet.

Edit: It appears to me this answer is being ignored (no votes, no comments). Is it not useful? Please comment so I know what to improve.

查看更多
我想做一个坏孩纸
4楼-- · 2020-05-16 04:58

I would ask the interviewer(s) why they don't want you using "other data structures" when there is clearly a built-in structure designed for this purpose - HashSet.

  1. It is O(n). You probably won't do much better than this using other methods, unless you do something really clever and get it down to O(log n).
  2. This is Java - not C. There are readily available data structures to do this, painlessly, with almost no additional effort on the programmer's part.

From the Java Documentation on the Collections Framework:

The collections framework is a unified architecture for representing and manipulating collections, allowing them to be manipulated independently of the details of their representation. It reduces programming effort while increasing performance. It allows for interoperability among unrelated APIs, reduces effort in designing and learning new APIs, and fosters software reuse.

Addendum

Most of the comments below argue that this is merely an exercise - to determine the skills of the programmer. My counterargument to this is simple:

This "interview" is for a Java programming position. Java, being an object-oriented language, has the ability to perform tasks such as these without needing to design a process from scratch (like in C and various other low level languages). In addition, Java is not the best choice when space complexity is a concern. That said, read entry one in my list above again.

查看更多
再贱就再见
5楼-- · 2020-05-16 04:58

Here is O(n) Time on an Average algorithm

public static int firstRepeatingElement(int[] elements) {
    int index = -1;
    Set<Integer> set = new HashSet<Integer>();

    for (int i = elements.length - 1; i >=0; i--) {
        if (set.contains(elements[i])) {
            index = i;
        }
        set.add(elements[i]);
    }
    if (index != -1) {
        return elements[index];
    }
    throw new IllegalArgumentException("No repeating elements found");
}

Here is the test cases

@Test
public void firstRepeatingElementTest() {
    int [] elements = {1,2,5,7,5,3,10,2};
    int element = ArrayUtils.firstRepeatingElement(elements);
    assertThat(element, is(2));
}

@Test(expected=IllegalArgumentException.class)
public void firstRepeatingElementTestWithException() {
    int [] elements = {1,2,5,7,3,10};
    int element = ArrayUtils.firstRepeatingElement(elements);
    assertThat(element, is(2));
}
查看更多
老娘就宠你
6楼-- · 2020-05-16 04:58

I believe this is the "linear hashing" solution that your interviewers were looking for. We first need to assume two additional constraints:

  1. length of A is >= max value of A
  2. All values of A are positive

With these additional constraints we can solve the problem using less time and space.

Ok, let's get to the code:

int findFirstDuplicateEntry(int[] A) {
    for (int i=0; i<A.length; i++) {
        if (A[Math.abs(A[i])-1]<0)
            return Math.abs(A[i]);
        else {
            A[Math.abs(A[i])-1] = -A[Math.abs(A[i])-1];
        }
    }
    return -1;
}

What I am doing here is using the array itself to store some additional information. As I iterate through the array, each time I come across a value, I will use that value as an index. At this index I will check the value. If the value is negative, I know that I have been here before (because of the all positive constraint). Therefore I have found my first duplicate, and can exit. Otherwise, I will negate the value at that index.

查看更多
小情绪 Triste *
7楼-- · 2020-05-16 05:04

Pseudo code:

res = -1;
startArray = [...];
sortedArray = mergeSort(startArray);
for i = 1 to n
     x = bynary_search(sortedArray, startArray[i]); //array, element
     if ((sorted_array[x] == sortedArray[x-1])    ||   (sorted_array[x] == sortedArray[x+1]))
           res = i;
           break;
if (res != -1)
     print('First duplicate is ',startArray[res]);
else
     print('There are no duplicates');

Merge sort worst case O(n log n)

Binary search worst case O(log n)

n times Binary search worst case O(n log n)

Total O(n log n)

查看更多
登录 后发表回答