Let's assume that we have a very large file which contains billions of integers , and we want to find k
largest elements of these values ,
the tricky part is that k
itself is very large too , which means we cannot keep k
elements in the memory (for example we have a file with 100 billon elements and we want to find 10 billion largest elements)
How can we do this in O(n)
?
What I thought :
We start reading the file and we check it with another file which keeps the k
largest elements (sorted in increasing order) , if the read element is larger than the first line of the second file we delete the first line and we insert it into the second file , the time complexity would be of O(NlogK)
(if we have random access to that file , otherwise it would be 'O(Nk)'
Any idea to do this in O(n)
, I guess if we have external version of Selection algorithm
(the partitioning algorithm in quicksort) we would be able to do this in O(n)
but I couldn't find it anywhere
PS: My definition of K is different. It is a smallish number say 2 or 100 or 1000. Here m corresponds to OPS's definition of k. Sorry about this.
Depends on how many reads you can do of the original data and how much more space you have. This approach assumes you have extra space equivalent to the original data.
Step 1: Pick K random numbers across the whole data
Step 2: Sort the K numbers (assume index are from 1 to K)
Step 3: Create K+1 separate files and name them 0 to K
Step 4: For every element in the data, if it is between ith and i+th element put it in ith file.
Step 5: Based on the size of each file, choose the file that is going to have mth number.
Step 6: Repeat everything with the new file and new m (new_m = m - sum_of_size_of_all_lower_files)
Regarding the last step, if K=2, m=1000 and size of file 0 is 800, 1 is 900 and 2 is 200, new_m = m-800 = 200 and work through file 1 iteratively.
You can do this pretty easily with a standard merge type algorithm.
Say you have 100 billion numbers and you want the top 10 billion. We'll say you can hold 1 billion numbers in memory at any time.
So you make a pass:
while not end of input
read 1 billion numbers
sort them in descending order
save position of output file
write sorted numbers to output file
You then have a file that contains 100 blocks of 1 billion numbers each. Each block is sorted in descending order.
Now create a max heap. Add the first number of each block to the heap. You'll also have to add the block number or the number's position in the file so that you can read the next number.
Then:
while num_selected < 10 billion
selected = heap.remove()
++num_selected
write selected to output
read next number from the selected block and place on heap
There's a small bit of complexity involved, keeping track of which block the number came from, but it's not too bad.
The max heap never contains more than 100 items (basically, one item per block), so memory isn't an issue in the second pass. With a bit of work, you can avoid a lot of reads by creating a smallish buffer for each block so that you don't incur the cost of a disk read for every number that's selected.
It's basically just a disk merge sort, but with an early out.
Complexity of the first pass is b * (m log m)
, where b is the number of blocks and m is the number of items in a block. N, the total number of items in the file, is equal to b * m
. Complexity of the second pass is k log b
, where k
is the number of items to select and b is the number of blocks.
you can do this by maintaining a min heap of max size k
.
Every time a new number arrives - check if the heap is smaller then k
, if it is - add it.
If it is not - check if the minimum is smaller then the new element, and if it is, pop it out and insert the new element instead.
When you are done - you have a heap containing k largest elements.
This solution is O(nlogk) complexity, where n is the number of elements and k is the number of elements you need .
- It can be done also in O(n) using selection algorithm - store all the elements, and then find the
(k+1)th
largest element, and return everything bigger then it. But it is harder to implement and for reasonable size input - might not be better. Also, if stream contains duplicates, more processing is needed
If all values are distinct or we can ignore doublets and we have 32bit integers, I would simply use one bit per possible value (needs 2^32 bits = 2^29 bytes = 512 megabytes (should fit in your RAM)).
- Initialize the 512MB with 0
- While linear reading the file ( O(n) ) set the corresponding bit for each read value.
- At the end look for the first k set bits to get the k largest values. ( O(2^32) bit tests)
If the values are not distinct and you want to know how often the values occur, you can add a 4th step where you read the file again and count the number of occurrences of the values found in the first 3 steps. That is still O(n).
Use randomised selection to find the kth largest element in the file. You can do this in linearly many passes over the input as long as it's not too ridiculously many times larger than memory. Then just dump out everything that's at least as large as it.