I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern
( a function from the Biostrings
R package) to find theses amino acids:
As an example mydata.txt could be:
>a
atgaatgctaaccccaccgagtaa
>b
atgctaaccactgtcatcaatgcctaa
>c
atggcatgatgccgagaggccagaataggctaa
>d
atggtgatagctaacgtatgctag
>e
atgccatgcgaggagccggctgccattgactag
file=read.fasta(file="mydata.txt")
matchPattern( "atg" , file)
Note: read.fasta is a function in seqinr
package that used to import fasta format files.
But this commands didn't work! How can I use this function to find start and stop codons in each exon sequence? (without frame shifting)
The 'subject' argument for
matchPattern
is a special object (e.g. XString). You can convert your sequences to XStrings by collapsing them with paste and using?BString
.So, with your data:
For a simple example, finding the number and locations of 'atg's in a sequence:
Also, check out
?DNAString
and?RNAString
. They are similar toBString
only they are limited to nucleotide characters, and allow for quick comparisons between DNA and RNA sequences.Edit to address frame shifting concern mentioned in the comments: You can subset the result to get those 'atg's that are in frame using the modulo trick mentioned by @DWin.
It's hard for me to believe this hasn't yet been done by one of the BioC packages, but if you want to do it with base R functionality, then consider using
gregexpr
You check to see if the stop codons are "in frame" reads by checking the distance being evenly divisible by 3: