In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better.
Say these are two multiple sequence alignments.
library(Biostrings)
set1<-AAStringSet(c("IVR", "RDG", "LKS"))
names(set1)<-paste("org", 1:3, sep="_")
set2<-AAStringSet(c("VRT", "RKG", "AST"))
names(set2)<-paste("org", 2:4, sep="_")
set1
A AAStringSet instance of length 3
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
set2
A AAStringSet instance of length 3
width seq names
[1] 3 VRT org_2
[2] 3 RKG org_3
[3] 3 AST org_4
The correct concatenation of these sequences would be
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
The "-" notes a 'gap' (lack of amino acid) in that position, or in this case a lack of a gene to concatenate.
I thought there would be a function to do this in BioStrings
, MSA
, DECIPHER
, or other related packages, but have been unable to find one.
I found the following Q&As, each does not provide the desired output as described.
1: https://support.bioconductor.org/p/38955/
output
A AAStringSet instance of length 6
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
[4] 3 VRT org_2
[5] 3 RKG org_3
[6] 3 AST org_4
May be better described as 'appending' the sequences (joins the two sets vertically).
2: https://support.bioconductor.org/p/39878/
output
A AAStringSet instance of length 2
width seq
[1] 9 IVRRDGLKS
[2] 9 VRTRKGAST
Concatenates sequences in each set, a complete chimera of each set (certainly not desired).
3: How to concatenate two DNAStringSet sequences per sample in R?
output
A AAStringSet instance of length 3
width seq
[1] 6 IVRVRT
[2] 6 RDGRKG
[3] 6 LKSAST
Creates chimeras of sequences by the order they are in. Even worse with different number of sequences (loops and concatenates shorter set...)
4: https://www.biostars.org/p/115192/
Output
A AAStringSet instance of length 2
width seq
[1] 3 IVR
[2] 3 VRT
Only appends the first sequence from each set, not sure why anyone wants this...
I would normally think these kinds of processes would be done with some combination of bash
and Python
, but I'm using the DECIPHER
multiple sequence aligner in R
, so it makes sense to do the rest of the processing in R
. In the process of writing up this question I came up with an answer that I will post, but I'm kind of expecting someone to point me to the manual I missed that describes the function that does this. Thanks!