-->

A more complex version of “How can I tell if a str

2020-08-12 11:57发布

问题:

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.

For example, find all the repetitive motifs in

string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

Here the repetitive motifs: 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

So, the output should be something like this:

output = {'ACGT': {'repeat': 2,
                   'region': (5,13)},
          'GT': {'repeat': 3,
                 'region': (19,24)},
          'TATACG': {'repeat': 2,
                     'region': (29,40)}}

This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.

UPDATE 1: Asterisks were removed from the string variable. It was a mistake.

UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.

回答1:

you can use a recursion function as following :

Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)

import re
def finder(st,past_ind=0,result=[]):
   m=re.search(r'(.+)\1+',st)
   if m:
      i,j=m.span()
      sub=st[i:j]
      ind = (sub+sub).find(sub, 1)
      sub=sub[:ind]
      if len(sub)>1:
        result.append([sub,(i+past_ind+1,j+past_ind+1)])
      past_ind+=j
      return finder(st[j:],past_ind)
   else:
      return result



s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)

result:

[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]

answer to previous question for following string :

s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'

You can use both answers from mentioned question and some extra recipes :

First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :

So the result will be :

>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']

Note about 'ACGTACGT' that missed the A at the end!

Then you can use principal_period's function to get the repeated sub strings :

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
... 

So you will have the repeated strings in l and main strings in sub :

>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']

Then you need a the region that you can do it with span method :

>>> for t in sub:
...    regons.append(re.search(t,s).span())

>>> regons
[(6, 14), (24, 30), (38, 50)]

And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :

>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

The main code :

>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
... 

>>> for t in sub:
...    regons.append(re.search(t,s).span())
... 
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}


回答2:

If you can bound your query then you can use a single pass of the string. The number of comparisons will be length of string * (max_length - min_length) so will scale linearly.

s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

def find_repeats(s, max_length, min_length=2):

    for i in xrange(len(s)):
        for j in xrange(min_length, max_length+1):
            count = 1
            while s[i:i+j] == s[i+j*count:i+j*count+j]: count += 1
            if count > 1:
                yield s[i:i+j], i, count

for pattern, position, count in find_repeats(s, 6, 2):
    print "%6s at region (%d, %d), %d repeats" % (pattern, position, position + count*len(pattern), count)

Output:

    AC at region (2, 6), 2 repeats
  ACGT at region (4, 12), 2 repeats
  CGTA at region (5, 13), 2 repeats
    GT at region (18, 24), 3 repeats
    TG at region (19, 23), 2 repeats
    GT at region (20, 24), 2 repeats
    CC at region (24, 28), 2 repeats
    TA at region (28, 32), 2 repeats
TATACG at region (28, 40), 2 repeats
ATACGT at region (29, 41), 2 repeats
    TA at region (34, 38), 2 repeats

Note that this catches a fair few more overlapping patterns than the regexp answers, but without knowing more about what you consider a good match it is difficult to reduce it further, for example why is TATACG better than ATACGT?

Extra: Using a dict to return matches is a bad idea as the patterns are not going to be unique.



回答3:

This simple while loop detects all repeated patterns:

def expand():
    global hi
    hi += 1

def shrink():
    global lo
    lo += 1

s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

motifs = set()

lo = 0
hi = 0

f = expand

while hi <= len(s):
    sub = s[lo : hi+1]
    if s.count(sub) > 1:
        motifs.add(sub)
        if lo==hi: f = expand
        f()
    else:
        f = shrink if lo<=hi else expand
        f()

At this point, motifs contains all the repeated patterns... Let's filter them with some criteria:

minlen = 3
for m in filter(lambda m: len(m)>=minlen and s.count(2*m)>=1, motifs):
    print(m)

'''
ATACGT
ACGT
TATACG
CGTA
'''


回答4:

You can use the fact that in regex, lookaheads do not advance the primary iterator. Thus, you can nest a capture group within a lookahead to find the (potentially overlapping) patterns that repeat and have a specified minimum length:

>>> import re
>>> s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
>>> re.findall(r'(?=(.{2,})\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TATACG', 'ATACGT', 'TA']
>>> re.findall(r'(?=(.{2,}?)\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TA', 'ATACGT', 'TA']

Note the slightly different results between using a greedy and a non-greedy quantifier. The greedy quantifier searches for the longest repeating substring starting from every index in the original string, if one exists. The non-greedy quantifier searches for the shortest of the same. The limitation is that you can only get a maximum one pattern per starting index in the string. If you have any ideas to solve this problem, let me know! Potentially, we can use the greedy quantifier regex to set up a recursive solution that finds every repeating pattern starting from each index, but let's avoid "premature computation" for now.

Now if we take the regex (?=(.{2,})\1+) and modify it, we can also capture the entire substring that contains repeated motifs. By doing this, we can use the span of the matches to calculate the number of repetitions:

(?=((.{2,})\2+))

In the above regex, we have a capture group inside a capture group inside a lookahead. Now we have everything we need to solve the problem:

def repeated_motifs(s):
    import re
    from collections import defaultdict
    rmdict = defaultdict(list)
    for match in re.finditer(r'(?=((.{2,})\2+))', s):
        motif = match.group(2)
        span1, span2 = match.span(1), match.span(2)
        startindex = span1[0]
        repetitions = (span1[1] - startindex) // (span2[1] - startindex)
        others = rmdict[motif]
        if not others or startindex > others[-1]['region'][1]:
            others.append({'repeat': repetitions, 'region': span1})
    return rmdict


s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
d = repeated_motifs(s)
print(d)
# list of the repeating motifs we have found sorted by first region
print(sorted(list(d.keys()), key=lambda k: d[k][0]['region']))

Because desired behavior in the situation where a motif repeats itself in multiple "regions" of the string was not specified, I have made the assumption that OP would like a dictionary of string->list where each list contains its own set of dictionaries.