I am trying to make a function in R which could calculate the frequency of each codon. We know that methionine is an amino acid which could be formed by only one set of codon ATG so its percentage in every set of sequence is 1. Where as Glycine could be formed by GGT, GGC, GGA, GGG hence the percentage of occurring of each codon will be 0.25. The input would be in a DNA sequence like-ATGGGTGGCGGAGGG and with the help of codon table it could calculate the percentage of each occurrence in an input.
please help me by suggesting ways to make this function.
for example, if my argument is ATGTGTTGCTGG then, my result would be
ATG=1
TGT=0.5
TGC=0.5
TGG=1
Data for R:
codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
A slightly different path leads to this solution:
First,
codon
is just a named vector, not list; here are the weightsSecond, probably there is a vector of DNA sequences, rather than one.
To develop the solution, codons can be found by searching for any nucleotide
[ACGT]
repeated{3}
timesIt seems like it is then convenient to do operations in the tidyverse, creating a tibble (data.frame) where
id
indicates which sequence the codon is fromand then add the weights
so we have
Standard tidyverse operations could be used for further summary, in particular when the same codon appears multiple times
First, I get my lookup list and sequence.
Next, I load the
stringi
library and break the sequence into chunks of three characters.Then, I count the letters that these three base chunks correspond to using
table
.Finally, I bind the sequences together with the reciprocal of the count and rename my data frame columns.
Two things to solve here:
convert
codon
to the fractions for each letterconvert the sequence string to a vector of length-3 substrings
From here, it's just a lookup:
EDIT
Since you want the status of the other codons, try this: