Algo to find duplicates in a very large array

2020-07-20 03:44发布

问题:

During one of technical interview got this question. I know way to solve this problem using (in java) HashSet.

But i could not understand when interviewer forced on the word "a very large array let's say 10 million elements in the given array".

Do i need to change the approach? If not, what should be the efficient to achieve this?

PS: Algo or implementation is language agnostic.

Thank you.

回答1:

One thing to keep in mind is that O-notation doesn't necessarily tell you what algorithm is fastest. If one algorithm is O(n log n) and another algorithm is O(n2), then there is some value M such that the first algorithm is faster for all n > M. But M could be much larger than the amount of data you'll ever have to deal with.

The reason I'm bringing this up is that I think a HashSet is probably still the best answer, although I'd have to profile it to find out for sure. Assuming that you aren't allowed to set up a hash table with 10 million buckets, you may still be able to set up a reasonable-sized table. Say you can create a HashSet with table size 100,000. The buckets will then be sets of objects. If n is the size of the array, the average bucket size will be n / 100000. So to see if an element is already in the HashSet, and add it if not, will take a fixed amount of time to compute the hash value, and O(n) to search the elements in the bucket if they're stored in a linear list(*). Technically, this means that the algorithm to find all duplicates is O(n2). But since one of the n's in n2 is for a linear list that is so much smaller than the array size (by a factor of 100000), it seems likely to me that it will still take much less time than a O(n log n) sort, for 10 million items. The value of M, the point at which the O(n log n) sort becomes faster, is likely to be much, much larger than that. (I'm just guessing, though; to find out for certain would require some profiling.)

I'd tend to lean against using a sort anyway, because if all you need to do is find duplicates, a sort is doing more work than you need. You shouldn't need to put the elements in order, just to find duplicates. That to me suggests that a sort is not likely to be the best answer.

(*) Note that in Java 8, the elements in each bucket will be in some kind of search tree, probably a red-black tree, instead of in a linear list. So the algorithm will still be O(n log n), and still probably lots faster than a sort.



回答2:

There were some key things, which interviewer expected you to ask back like: if you can not load the array in memory, then how much I can load. These are the steps to solve the problem:

  1. you need to divide the array in how much memory is available to you.
  2. Let's say you can load 1M number at a time. You have split the data in k parts. You load the first 1M and build Min Heap of it. Then remove the top and apply the Heapify on Min Heap.
  3. Repeat the same for other parts of the data.
  4. Now you will have K sorted splits.
  5. Now fetch a first number from each K split and again build a Min Heap.
  6. Now remove the top from the Min Heap and store the value in temporary variable as well for comparing with the next coming number for finding the duplicates.
  7. Now fetch the next number from the same split (part) whose number got removed last time. Put that number on top of Min Heap and apply Heapify.
  8. Now the top of the Min Heap is your next sorted number and compare it with the temporary variable for finding the duplicates. Update thetemporary variable` if number is not duplicate.


回答3:

You can do it in O(nlog(n)):

  • Sort the array
  • find the duplicates (they will be next to each other) in one pass.

I think that is what the interviewer wanted to hear.

if you did a merge sort or a quick sort, finding the duplicates could be done at merging in hidden time. These can be implemented "in-place", or "by-part" if the array is too large to fit in memory.



回答4:

In short you have to find out all unique elements from array

So you can create an object and add each element from array as a property of object.

function uniqueArray(arr){
    var length = arr. length,
        uniqueElementArray = [];
    while(length >= 0){
        obj [arr[length]] = true;
        length-- ;

    }

    for(var i in obj){
       uniqueElementArray.push[i];
    }
    return uniqueElementArray;
}


回答5:

So assuming that the very large array could fit into memory but leaving little addition memory (i.e. another data structure of similar size to the array) to play with then with some assumptions you can do this in O(n) time and in place with no additional memory.
Assumption 1: all the values in array: 0 <= value < array length (10,000,000)
Assumption 2: you can modify the array

>>> arr = [3, 1, 3, 5, 4, 3, 4, 2, 1]
>>> for i, v in enumerate(arr):
>>>     while arr[v] != arr[i]:
>>>         arr[i], arr[v] = arr[v], arr[i]
>>>         v = arr[i]
>>> arr
[3, 1, 2, 3, 4, 5, 4, 3, 1]

Duplicates are in positions where the value doesn't equal the index.

>>> [v for i, v in enumerate(arr) if i != v]
[3, 4, 3, 1]