Most commonly occurring combination

2019-07-17 02:48发布

I have a list of integer array, where each array have some numbers sorted. Here I want to find the most commonly occurring combination of sequence of integers based on all the array. For example if the list of array is as follows

A1 - 1 2 3 5 7 8
A2 - 2 3 5 6 7
A3 - 3 5 7 9
A4 - 1 2 3 7 9
A5 - 3 5 7 10

Here

{3,5,7} - {A1,A3,A5}
{2,3}   - {A1,A2,A4}

So we can take {3,5,7} or {2,3} as the most commonly occurring combinations.

Now the algorithm i used is as following

Find intersection of a set with all others. And store the resulting set. Increment a resulting set occurrence in case if its already exist. for eg : Find intersections of all the below

A1 intersection A2 
A1 intersection A3
A1 intersection A4
A1 intersection A5
A2 intersection A3  
A2 intersection A4 
A2 intersection A5  
A3 intersection A4
A3 intersection A5  
A4 intersection A5  

Here A1 intersection A3 is same as A3 intersection A5 , hence set-{3,5,7} occurrence can be set as 2. Similarly each resulting set occurrence can be determined.

But this algorithm demands O(n^2) complexity. Assuming each set is sorted , am pretty sure that we can find a better algorithm with O(n) complexity which i am not able to pen down. Can anyone suggest a O(n) algorithm for the same.

标签: algorithm set
2条回答
闹够了就滚
2楼-- · 2019-07-17 03:10

This example in Haskell does not scan intersections. Rather, it lists the sub-sequences for each list and aggregates them into an array indexed by sub-sequence. To look up the most commonly occurring sub-sequence, simply show the longest element in the array. The output is filtered to show sub-sequences greater than length 1. Output is a list of tuples showing the sub-sequence and indexes of the lists where the sub-sequence appears:

*Main> combFreq [[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]
[([3,5],[4,2,1,0]),([5,7],[4,2,0]),([3,5,7],[4,2,0]),([2,3],[3,1,0]),([7,9],[3,2]),([2,3,5],[1,0]),([1,2,3],[3,0]),([1,2],[3,0])]

import Data.List
import qualified Data.Map as M
import Data.Function (on)

sInt xs = concat $ zipWith (\x y -> zip (subs x) (repeat y)) xs [0..]
  where subs = filter (not . null) . concatMap inits . tails

res xs = foldl' (\x y -> M.insertWith (++) (fst y) [snd y] x) M.empty (sInt xs)

combFreq xs = reverse $ sortBy (compare `on` (length . snd)) 
            . filter (not . null . drop 1 . snd)
            . filter (not . null . drop 1 . fst)
            . M.toList
            . res $ xs
查看更多
再贱就再见
3楼-- · 2019-07-17 03:26

If you have a sequence of length n, then its prefix is of length n-1 and occurs at least as often - a degenerate case is the most common character, which is a sequence of length 1 that occurs at least as often as any longer sequence. Do you have a minimum suffix length you are interested in?

Regardless of this, one idea is to concatenate all of the sequences, separating them by different integers which appear nowhere else, and then compute the http://en.wikipedia.org/wiki/Suffix_array in linear time. One pass through the suffix array should allow you to find the most common subsequence of any given length - and it shouldn't cross the gap between two different arrays, because each such sequence of length n is unique, because the characters separating the arrays are unique. (see also the http://en.wikipedia.org/wiki/LCP_array)

查看更多
登录 后发表回答