How to find specific frequency of a codon?

I am trying to make a function in R which could calculate the frequency of each codon. We know that methionine is an amino acid which could be formed by only one set of codon ATG so its percentage in every set of sequence is 1. Where as Glycine could be formed by GGT, GGC, GGA, GGG hence the percentage of occurring of each codon will be 0.25. The input would be in a DNA sequence like-ATGGGTGGCGGAGGG and with the help of codon table it could calculate the percentage of each occurrence in an input.

please help me by suggesting ways to make this function.

for example, if my argument is ATGTGTTGCTGG then, my result would be

ATG=1
TGT=0.5
TGC=0.5
TGG=1

Data for R:

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
    ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
    AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
    CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
    CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
    CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
    GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
    GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
    GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
    TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
    TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

标签： r bioinformatics dna-sequence

3条回答

干净又极端

2楼-- · 2019-07-10 04:16

A slightly different path leads to this solution:

f0 <- function(dna, weight) {
    codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
    tibble(id = seq_along(codons), codons = codons) %>%
        unnest() %>%
        mutate(weight = as.vector(wt[codons]))
}

First, codon is just a named vector, not list; here are the weights

codon <- unlist(codon)
weight <- setNames(1 / table(codon)[codon], names(codon))

Second, probably there is a vector of DNA sequences, rather than one.

dna <- c("ATGTGTTGCTGG", "GGTCGTTGTGTA")

To develop the solution, codons can be found by searching for any nucleotide [ACGT] repeated {3} times

codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))

It seems like it is then convenient to do operations in the tidyverse, creating a tibble (data.frame) where id indicates which sequence the codon is from

library(tidyverse)
tbl <- tibble(id = seq_along(codons), codon = codons) %>% unnest()

and then add the weights

tbl <- mutate(tbl, weight = as.vector(weight[codon]))

so we have

> tbl
# A tibble: 8 x 3
     id codon weight
  <int> <chr>  <dbl>
1     1 ATG    1    
2     1 TGT    0.5  
3     1 TGC    0.5  
4     1 TGG    1    
5     2 GGT    0.25 
6     2 CGT    0.167
7     2 TGT    0.5  
8     2 GTA    0.25

Standard tidyverse operations could be used for further summary, in particular when the same codon appears multiple times

tbl %>% group_by(id, codon) %>%
    summarize(wt = sum(weight))

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-07-10 04:20

First, I get my lookup list and sequence.

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
              ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
              AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
              CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
              CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
              CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
              GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
              GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
              GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
              TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
              TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

MySeq <- "ATGTGTTGCTGG"

Next, I load the stringi library and break the sequence into chunks of three characters.

# Load library
library(stringi)

# Break into 3 bases
seq_split <- stri_sub(MySeq, seq(1, stri_length(MySeq), by=3), length=3)

Then, I count the letters that these three base chunks correspond to using table.

# Get associated letters
letter_count <- table(unlist(codon[seq_split]))

Finally, I bind the sequences together with the reciprocal of the count and rename my data frame columns.

# Bind into a data frame
res <- data.frame(seq_split,
                  1/letter_count[match(unlist(codon[seq_split]), names(letter_count))])

# Rename columns
colnames(res) <- c("Sequence", "Letter", "Percentage")

#  Sequence Letter Percentage
#1      ATG      M        1.0
#2      TGT      C        0.5
#3      TGC      C        0.5
#4      TGG      W        1.0

0人赞添加讨论(0) 举报

forever°为你锁心

4楼-- · 2019-07-10 04:30

Two things to solve here:

convert codon to the fractions for each letter

( fracs <- 1/table(unlist(codon)) )
#         A         C         D         E         F         G         H         I 
# 0.2500000 0.5000000 0.5000000 0.5000000 0.5000000 0.2500000 0.5000000 0.3333333 
#         K         L         M         N         P         Q         R         S 
# 0.5000000 0.1666667 1.0000000 0.5000000 0.2500000 0.5000000 0.1666667 0.1666667 
#      stop         T         V         W         Y 
# 0.3333333 0.2500000 0.2500000 1.0000000 0.5000000 
codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon))
str(head(codonfracs))
# List of 6
#  $ ATA: num 0.333
#  $ ATC: num 0.333
#  $ ATT: num 0.333
#  $ ATG: num 1
#  $ ACA: num 0.25
#  $ ACC: num 0.25

convert the sequence string to a vector of length-3 substrings

s <- 'ATGTGTTGCTGG'

strsplit3 <- function(s, k=3) {
  starts <- seq.int(1, nchar(s), by=k)
  stops <- c(starts[-1] - 1, nchar(s))
  mapply(substr, s, starts, stops, USE.NAMES=FALSE)
}
strsplit3(s)
# [1] "ATG" "TGT" "TGC" "TGG"

From here, it's just a lookup:

codonfracs[ strsplit3(s) ]
# $ATG
# [1] 1
# $TGT
# [1] 0.5
# $TGC
# [1] 0.5
# $TGG
# [1] 1

EDIT

Since you want the status of the other codons, try this:

x <- codonfracs
x[ ! names(x) %in% strsplit3(s) ] <- 0
str(x)
# List of 64
#  $ ATA: num 0
#  $ ATC: num 0
#  $ ATT: num 0
#  $ ATG: num 1
#  $ ACA: num 0
#  $ ACC: num 0
#  $ ACG: num 0
# ...snip...
#  $ TAT: num 0
#  $ TAA: num 0
#  $ TAG: num 0
#  $ TGC: num 0.5
#  $ TGT: num 0.5
#  $ TGA: num 0
#  $ TGG: num 1

0人赞添加讨论(0) 举报

How to find specific frequency of a codon?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间