I have a list of integer array, where each array have some numbers sorted. Here I want to find the most commonly occurring combination of sequence of integers based on all the array. For example if the list of array is as follows
A1 - 1 2 3 5 7 8
A2 - 2 3 5 6 7
A3 - 3 5 7 9
A4 - 1 2 3 7 9
A5 - 3 5 7 10
Here
{3,5,7} - {A1,A3,A5}
{2,3} - {A1,A2,A4}
So we can take {3,5,7} or {2,3} as the most commonly occurring combinations.
Now the algorithm i used is as following
Find intersection of a set with all others. And store the resulting set. Increment a resulting set occurrence in case if its already exist. for eg : Find intersections of all the below
A1 intersection A2
A1 intersection A3
A1 intersection A4
A1 intersection A5
A2 intersection A3
A2 intersection A4
A2 intersection A5
A3 intersection A4
A3 intersection A5
A4 intersection A5
Here A1 intersection A3 is same as A3 intersection A5 , hence set-{3,5,7} occurrence can be set as 2. Similarly each resulting set occurrence can be determined.
But this algorithm demands O(n^2) complexity. Assuming each set is sorted , am pretty sure that we can find a better algorithm with O(n) complexity which i am not able to pen down. Can anyone suggest a O(n) algorithm for the same.
This example in Haskell does not scan intersections. Rather, it lists the sub-sequences for each list and aggregates them into an array indexed by sub-sequence. To look up the most commonly occurring sub-sequence, simply show the longest element in the array. The output is filtered to show sub-sequences greater than length 1. Output is a list of tuples showing the sub-sequence and indexes of the lists where the sub-sequence appears:
*Main> combFreq [[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]
[([3,5],[4,2,1,0]),([5,7],[4,2,0]),([3,5,7],[4,2,0]),([2,3],[3,1,0]),([7,9],[3,2]),([2,3,5],[1,0]),([1,2,3],[3,0]),([1,2],[3,0])]
If you have a sequence of length n, then its prefix is of length n-1 and occurs at least as often - a degenerate case is the most common character, which is a sequence of length 1 that occurs at least as often as any longer sequence. Do you have a minimum suffix length you are interested in?
Regardless of this, one idea is to concatenate all of the sequences, separating them by different integers which appear nowhere else, and then compute the http://en.wikipedia.org/wiki/Suffix_array in linear time. One pass through the suffix array should allow you to find the most common subsequence of any given length - and it shouldn't cross the gap between two different arrays, because each such sequence of length n is unique, because the characters separating the arrays are unique. (see also the http://en.wikipedia.org/wiki/LCP_array)