how use matchpattern() to find certain aminoacid i

I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern ( a function from the Biostrings R package) to find theses amino acids:

As an example mydata.txt could be:

>a
atgaatgctaaccccaccgagtaa
>b
atgctaaccactgtcatcaatgcctaa
>c
atggcatgatgccgagaggccagaataggctaa
>d
atggtgatagctaacgtatgctag
>e
atgccatgcgaggagccggctgccattgactag

file=read.fasta(file="mydata.txt") 
matchPattern( "atg" , file)

Note: read.fasta is a function in seqinr package that used to import fasta format files.

But this commands didn't work! How can I use this function to find start and stop codons in each exon sequence? (without frame shifting)

标签： r bioinformatics fasta bioconductor

2条回答

小情绪 Triste *

2楼-- · 2019-04-02 14:54

The 'subject' argument for matchPattern is a special object (e.g. XString). You can convert your sequences to XStrings by collapsing them with paste and using ?BString.

So, with your data:

file = read.fasta(file = "mydata.txt")

# find 'atg' locations
atg <- lapply(file, function(x) {
  string <- BString(paste(x, collapse = ""))
  matchPattern("atg", string)
})

atg[1:2]
# $a
#   Views on a 18-letter BString subject
# subject: atgacccccaccgagtaa
# views:
#     start end width
# [1]     1   3     3 [atg]
#
# $b
#  Views on a 21-letter BString subject
# subject: atgcccactgtcatcacctaa
# views:
#     start end width
# [1]     1   3     3 [atg]

For a simple example, finding the number and locations of 'atg's in a sequence:

sequence <- BString("atgatgccatgcccccatgcatgatatg")
result <- matchPattern("atg", sequence)
#   Views on a 28-letter BString subject
# subject: atgatgccatgcccccatgcatgatatg
# views:
#     start end width
# [1]     1   3     3 [atg]
# [2]     4   6     3 [atg]
# [3]     9  11     3 [atg]
# [4]    17  19     3 [atg]
# [5]    21  23     3 [atg]
# [6]    26  28     3 [atg]

# Find out how many 'atg's were found
length(result)
# [1] 6

# Get the start site of each 'atg'
result@ranges@start
# [1]  1  4  9 17 21 26

Also, check out ?DNAString and ?RNAString. They are similar to BString only they are limited to nucleotide characters, and allow for quick comparisons between DNA and RNA sequences.

Edit to address frame shifting concern mentioned in the comments: You can subset the result to get those 'atg's that are in frame using the modulo trick mentioned by @DWin.

# assuming the first 'atg' sets the frame
in.frame.result <- result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
# Views on a 28-letter DNAString subject
# subject: ATGATGCCATGCCCCCATGCATGATATG
# views:
#     start end width
# [1]     1   3     3 [ATG]
# [2]     4   6     3 [ATG]

# There are two 'atg's in frame in this result
length(in.frame.result)
# [1] 2

# With your data:
file = read.fasta(file = "mydata.txt")
atg <- lapply(file, function(x) {
  string <- BString(paste(x, collapse = ""))
  result <- matchPattern("atg", string)
  result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
})

0人赞添加讨论(0) 举报

【Aperson】

3楼-- · 2019-04-02 14:54

It's hard for me to believe this hasn't yet been done by one of the BioC packages, but if you want to do it with base R functionality, then consider using gregexpr

x <- c(a='atgaatgctaaccccaccgagtaa', 
  b='atgctaaccactgtcatcaatgcctaa', 
  c='atggcatgatgccgagaggccagaataggctaa', 
  d='atggtgatagctaacgtatgctag', 
  e='atgccatgcgaggagccggctgccattgactag')

starts<-lapply(gregexpr("atg", x), function(x) x[ (x-x[1] %% 3) == 0])
stops <- mapply(function(strg, starts) {poss <- gregexpr("taa|tga|tag", strg) ; poss[[1]][ ( (poss[[1]]-starts) )%% 3 == 0]}, x, starts=unlist(starts))
 stops
#--------------
$a
[1] 22

$b
[1] 25

$c
[1]  7 31

$d
[1] 22

$e
[1] 31

You check to see if the stop codons are "in frame" reads by checking the distance being evenly divisible by 3:

> (25-1)%%3
[1] 0

0人赞添加讨论(0) 举报

how use matchpattern() to find certain aminoacid i

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间