So I have four lines of code
seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
OR_0 = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',seq)
Let me explain what I am attempting to do first . . . I'm sorry if this confusing but I am going to try my best to explain it.
So I'm looking for sequences that START with 'ATG' followed by units of 3 of any word char [e.g. 'GGG','GTT','TTA',etc] until it encounters either an 'TAA','TAG' or 'TGA' I also want them to be at least 30 characters long. . . hence the {9,}?
This works to some degree but if you notice in seq that there is ATG GAA GTT GGA TGA AAG TGG AGG TAA AGA GAA GAC GTT TGA
So in this case, it should be finding 'ATGGAAGTTGGATGA' if it starts with the first 'ATG' and goes until the next 'TAA','TAG' or 'TGA'
HOWEVER
when you run the OR_0 line of code, it spits back out the entire seq string. I don't know how to make it only consider the first 'TAA','TAG' or 'TGA' followed by the first 'ATG'
If an 'ATG' is followed by another 'ATG' when read in units of 3 then that is alright, it should NOT start over but if it encounters a 'TAA','TAG' or 'TGA' when read in units of 3 it should stop.
My question, why is re.findall finding the longest sequence of 'ATG'xxx-xxx-['TAA','TAG' or 'TGA'] instead of the first occurrence of 'TAA','TAG' or 'TGA' after an ATG separated by word characters in units of 3 ?
Once again, I apologize if this is confusing but its messing with multiple data sets that I have based on this initial line of text and i'm trying to find out why
If you want your regex to stop matching at the first TAA|TAG|TGA
, but still only succeed if there are at least nine three letter chunks, the following may help:
>>> import re
>>> regexp = r'ATG(?:(?!TAA|TAG|TGA)...){9,}?(?:TAA|TAG|TGA)'
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAATAGAAAAAAAAAAAAAAAAAAAAATAG')
[]
This uses a negative lookahead (?!TAA|TAG|TGA)
to ensure that a three character chunk is not a TAA|TAG|TGA
before it matches the three character chunk.
Note though that a TAA|TAG|TGA
that does not fall on a three character boundary will still successfully match:
>>> re.findall(regexp, 'ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG']
If the length is not a requirement then it's pretty easy:
>>> import re
>>> seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
>>> regex = re.compile(r'ATG(?:...)*?(?:TAA|TAG|TGA)')
>>> regex.findall(seq)
['ATGGAAGTTGGATGA']
Anyway I believe, according to your explanation, that your previous regex is actually doing what you want: searching for matches of at least 30 characters that start in ATG
and end in TGA
.
In your question you first state that you need matches of at least 30 characters, and hence you put the {9,}?
, but after that you expect to match any match. You cannot have both, choose one. If length is important than keep the regex you already have and the result you are getting is correct.
You don't need regular expressions.
def chunks(l, n):
""" Yield successive n-sized chunks from l.
from: http://stackoverflow.com/a/312464/1561176
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
def method(sequence, start=['ATG'], stop=['TAA','TAG','TGA'], min_len=30):
response = ''
started = False
for x in chunks(sequence, 3):
if x in start:
started = True
response += x
elif x in stop and started:
if len(response) >= min_len:
yield response + x
response = ''
started = False
else:
response += x
elif started:
response += x
yield response
for result in method('ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'):
print result
If I use the min_len of 30, the return is:
ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA
If I use a min_len of 0, the return is:
ATGGAAGTTGGATGA
Try this:
seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
OR_0 = re.findall(r'ATG(?:.{3})*?(?:TAA|TAG|TGA)',seq)