I want to be able to search a Seq object for a subsequnce Seq object accounting for ambiguity codes. For example, the following should be true:
from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
amb = IUPACAmbiguousDNA()
s1 = Seq("GGAAAAGG", amb)
s2 = Seq("ARAA", amb) # R = A or G
print s1.find(s2)
If ambiguity codes were taken into account, the answer should be
>>> 2
But the answer i get is that no match is found, or
>>> -1
Looking at the biopython source code, it doesnt appear that ambiguity codes are taken into account, as the subseqeunce is converted to a string using the private _get_seq_str_and_check_alphabet method, then the built in string method find() is used. Of course if this is the case, the "R" ambiguity code will be taken as a literal "R", not an A or G.
I could figure out how to do this with a home made method, but it seems like something that should be taken care of in the biopython packages using its Seq objects. Is there something I am missing here.
Is there a way to search for sub sequence membership accounting for ambiguity codes?
From what I can read from the documentation for
Seq.find
here:http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html#find
It appears that this method works similar to the
str.find
method in that it looks for exact match. So, while the dna sequence can contain ambiguity codes, theSeq.find()
method will only return a match when the exact subsequence matches.To do what you want maybe the
ntsearch
function will work:Search for motifs with degenerate positions