You have an array size n
and a constant k
(whatever)
You can assume the the array is of int type (although it could be of any type)
Describe an algorithm that finds if there is an element(s) that repeats itself at least n/k
times... if there is return one. Do so in linear time (O(n)
)
The catch: do this algorithm (or even pseudo-code) using constant memory and running over the array only twice
This is my implementation of Jerky algorithm described above:
And the output is:
Repeating more than n/4 : 1
Repeating more than n/4 : 2
Repeating more than n/4 : 3
There are two common (theoretical) approaches to this problem in O(n)
I) The first idea is the simplest
Step 1) While there are more than k distinct elements, Select k distinct elements and erase them all.
Step 2) Test all k distinct remaining elements for it's frequency
Proof of correctness: Note that While step will be executed at most n/k - 1 times. Suppose there is an element that repeats itself at least n/k times. In the worst case it could be chosen in all n/k-1 iterations and it will still be in the final array after it, after being tested it will be found.
Implementation: Step 1 can be implemented keeping an associative array (maps a key to a value) of size k-1 (constant), you sweep from left to right on the array, if you find an element that is already on the map, increase it's counter to 1, if the element is not on the map and the map is not full yet (less than k-1 elements), add this new element with initial counting 1, if the map is full, remove 1 from the counter of each element, if any element reaches 0, remove it from the map. In the end, the elements on this map will be the equivalent as the remaining elements you need to test. If, in the last iteration your map becomes empty you need to test all the elements before erasing to cover the case that the frequency is exactly n/k.
Complexity: Considering the worst approach to this map, O(n * k) = O(n), as k is contant.
Step 2 can be implemented by counting the frequency of all (maximum) k-1 elements Complexity: O(k*n) = O(n)
Overall complexity: O(n) + O(n) = O(n). (there's a small detail that was different from the implementation, a difference of 1 element, this happens because we want to also cover the case of frequency exactly n/k repetitions in the pseudocode, if not, we could allow one more iteration be possible with there are exactly k different elements, not necessarily more than k)
II) The second algorithm uses the selection algorithm in linear time http://en.wikipedia.org/wiki/Selection_algorithm and the partition algorithm which also runs in linear time. Using them, you break your array in k-1 buckets, with the invariant that any element in the ith bucket is smaller or equal than any element in the jth bucket for j > i in O(n). But note that the elements are not sorted inside each bucket.
Now, you use the fact that each bucket has n/(k-1) elements, and you're looking for an element that repeats itself at least (n/k), and (n/k) > n/(2*(k-1)). This suffices to use the majority theorem, which states that if an element is the majority (more frequent than number of elements divided by 2), then it's also the median of the array. You can get the median again using the selection algorithm.
So, you just test all medians and all pivots for the partitions, which you need to test them because they may split equal values in two different buckets, there are k-1 + k values, complexity O((2*k-1)*n)) = O(n).
I don't know if you are restricted on which additional data structures you can use.
What about creating a hashmap with 'elements' <--> count mapping. Insertion is O(log N). Lookup is O(1). For each element, lookup on hashtable, insert if does not exist with count 1. If exists, check if count < (n/k). It will stay O(n).
EDIT:
I forgot the constant memory restriction. It's preallocating hash map entries with N elements permitted?
Create a temporary array of size (k-1) to store elements and their counts (The output elements are going to be among these k-1 elements).
Traverse through the input array and update temp[] (add/remove an element or increase/decrease count) for every traversed element. The array temp[] stores potential (k-1) candidates at every step. This step takes O(nk) time.
The main step is step 2, how to maintain (k-1) potential candidates at every point? The steps used in step 2 are like famous game: Tetris. We treat each number as a piece in Tetris, which falls down in our temporary array temp[]. Our task is to try to keep the same number stacked on the same column (count in temporary array is incremented).
Finally, we have at most k-1 numbers in temp[]. The elements in temp are {3, 1, 2}. Note that the counts in temp[] are useless now, the counts were needed only in step 2. Now we need to check whether the actual counts of elements in temp[] are more than n/k (9/4) or not. The elements 3 and 2 have counts more than 9/4. So we print 3 and 2.
Note that the algorithm doesn’t miss any output element. There can be two possibilities, many occurrences are together or spread across the array. If occurrences are together, then count will be high and won’t become 0. If occurrences are spread, then the element would come again in temp[]. Following is C++ implementation of above algorithm.
I'm not 100% sure, but it sounds like you want to solve the Britney Spears problem—finding an item that makes up a certain fraction of a sample using constant memory.
Here is a statement of the problem in English, with a sketch of the solution:
A simple O(n) algorithm would be to keep a hashed map from the number found to the number of instances found. Using a hashed map is important for maintaining O(n). The, a final pass over the map will reveal the answers. This pass is also O(n) since the worst case scenario is that every element appeared only once and hence the map is the same size as the original array.