Finding patterns in list

2020-05-23 02:35发布

I am currently searching for a way to find patterns in a list of integers, but the method I am going to use would be applicable to strings and other lists with different elements of course. Now let me explain what I am looking for.

I want to find the longest repeating pattern in a list of integers. For example,

[1, 2, 3, 4, 1, 2, 3]
# This list would give 1, 2, 3

Overlapping patterns should be discarded. ( Not certain )

[1, 1, 1, 1, 1]
# Should give 1, 1  Not 1, 1, 1, 1

Here is what did not help me.

Finding patterns in a list (Did not understand the logic behind the first answer, very little explanation. And second answer solves the problem only if the pattern is known before solving.)

Finding integer pattern from a list (Pattern is given and number of occurrence is wanted. Different than my question.)

Longest common subsequence problem (Most people dealed with this problem however it is not close to mine. I need consecutive elements while searching for a pattern. However in this, seperate elements also counted as subsequences.)

Here what I tried.

def pattern(seq):
    n = len(seq)
    c = defaultdict(int) # Counts of each subsequence
    for i in xrange(n):
        for j in xrange(i + 1, min(n, n / 2 + i)): 
            # Used n / 2 because I figured if a pattern is being searched
            # It cant be longer that the half of the list.
            c[tuple(seq[i:j])] += 1      
    return c

As you see, It finds all the sublists and check for repeats. I found this approach a bit naive(and inefficient) and I am in need of a better way. Please help me. Thanks in advance.

Note1: The list is predetermined but because of my algorithms failure, I can only check some parts of the list before freezing the computer. So the pattern I am trying to find can very well be longer than the half of the search list, It can even be longer than the search list itself because I am searching only a part of the original list.If you present a better method than I am using, I can search a larger part of the original list so I will have a better chance at finding the pattern. (If there is one.)

Note2: Here is a part of the list if you want to test it yourself. It really seems like that there is a pattern, but I cannot be sure before I test it with a reliable code. Sample List

Note3: I approach this as a serious problem of data mining. And will try to learn if you make a long explanation. This feels like a much more important problem than LCS, however LCS is much more popular :D This method I am trying to find feels like the methods scientists use to find DNA patterns.

2条回答
兄弟一词,经得起流年.
2楼-- · 2020-05-23 02:42

The Code

Ignoring the "no overlapping" requirement, here's the code I used:

def pattern(seq):
        storage = {}
        for length in range(1,len(seq)/2+1):
                valid_strings = {}
                for start in range(0,len(seq)-length+1):
                        valid_strings[start] = tuple(seq[start:start+length])
                candidates = set(valid_strings.values())
                if len(candidates) != len(valid_strings.values()):
                        print "Pattern found for " + str(length)
                        storage = valid_strings
                else:
                        print "No pattern found for " + str(length)
                        return set(filter(lambda x: storage.values().count(x) > 1, storage.values()))
        return storage

Using that, I found 8 distinct patterns of length 303 in your dataset. The program ran pretty fast, too.

Pseudocode Version

define patterns(sequence):
    list_of_substrings = {}
    for each valid length:  ### i.e. lengths from 1 to half the list's length
        generate a dictionary my_dict of all sub-lists of size length
        if there are repeats:
            list_of_substrings = my_dict
        else:
            return all repeated values in list_of_substrings
    return list_of_substrings  #### returns {} when there are no patterns
查看更多
欢心
3楼-- · 2020-05-23 02:47

I have an answer. It works. (without overlapping) but it is for python3

      def get_pattern(seq):
        seq2=seq
        outs={}
        l=0
        r=0
        c=None
        for end in range(len(seq2)+1):
          for start in range(end):
            word=chr(1).join(seq2[start:end])
            if not word in outs:
              outs[word]=1
            else:
              outs[word]+=1
        for item in outs:
          if outs[item]>r or (len(item)>l and outs[item]>1):
            l=len(item)
            r=outs[item]
            c=item
        return c.split(chr(1))
查看更多
登录 后发表回答