How to concatenate (merge) AAStringSets by name?

2019-08-29 23:37发布

问题:

In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better.

Say these are two multiple sequence alignments.

library(Biostrings)

set1<-AAStringSet(c("IVR", "RDG", "LKS"))
names(set1)<-paste("org", 1:3, sep="_")

set2<-AAStringSet(c("VRT", "RKG", "AST"))
names(set2)<-paste("org", 2:4, sep="_")

set1

A AAStringSet instance of length 3
    width seq    names               
[1]     3 IVR    org_1
[2]     3 RDG    org_2
[3]     3 LKS    org_3

set2

A AAStringSet instance of length 3
    width seq    names               
[1]     3 VRT    org_2
[2]     3 RKG    org_3
[3]     3 AST    org_4

The correct concatenation of these sequences would be

A AAStringSet instance of length 4
    width seq    names               
[1]     6 IVR--- org_1
[2]     6 RDGVRT org_2
[3]     6 LKSRKG org_3
[4]     6 ---AST org_4

The "-" notes a 'gap' (lack of amino acid) in that position, or in this case a lack of a gene to concatenate.

I thought there would be a function to do this in BioStrings, MSA, DECIPHER, or other related packages, but have been unable to find one.

I found the following Q&As, each does not provide the desired output as described.

1: https://support.bioconductor.org/p/38955/

output

  A AAStringSet instance of length 6
    width seq names               
[1]     3 IVR org_1
[2]     3 RDG org_2
[3]     3 LKS org_3
[4]     3 VRT org_2
[5]     3 RKG org_3
[6]     3 AST org_4

May be better described as 'appending' the sequences (joins the two sets vertically).

2: https://support.bioconductor.org/p/39878/

output

  A AAStringSet instance of length 2

        width seq
    [1]     9 IVRRDGLKS
    [2]     9 VRTRKGAST

Concatenates sequences in each set, a complete chimera of each set (certainly not desired).

3: How to concatenate two DNAStringSet sequences per sample in R?

output

  A AAStringSet instance of length 3
    width seq
[1]     6 IVRVRT
[2]     6 RDGRKG
[3]     6 LKSAST

Creates chimeras of sequences by the order they are in. Even worse with different number of sequences (loops and concatenates shorter set...)

4: https://www.biostars.org/p/115192/

Output

  A AAStringSet instance of length 2
    width seq
[1]     3 IVR
[2]     3 VRT

Only appends the first sequence from each set, not sure why anyone wants this...

I would normally think these kinds of processes would be done with some combination of bash and Python, but I'm using the DECIPHER multiple sequence aligner in R, so it makes sense to do the rest of the processing in R. In the process of writing up this question I came up with an answer that I will post, but I'm kind of expecting someone to point me to the manual I missed that describes the function that does this. Thanks!

回答1:

So I am a somewhat fanatical user of data.table in R, among many things it is great to merge datasets by names. I found Biostrings::AAStringSets can be converted to matrices using as.matrix and these can be converted to data.table and merged.

set1.dt<-data.table(as.matrix(set1), keep.rownames = TRUE)
set2.dt<-data.table(as.matrix(set2), keep.rownames = TRUE)
set12.dt<-merge(set1.dt, set2.dt, by="rn", all=TRUE)
    set12.dt
      rn V1.x V2.x V3.x V1.y V2.y V3.y
1: org_1    I    V    R <NA> <NA> <NA>
2: org_2    R    D    G    V    R    T
3: org_3    L    K    S    R    K    G
4: org_4 <NA> <NA> <NA>    A    S    T

This is the correct merge, but needs more work to get the final result.

Need to replace "NA" with "-". I always need to look up this question to remember the best way to do this with a data.table.

Fastest way to replace NAs in a large data.table

#slightly modified from original, added arg "x"
f_dowle = function(dt, x) {     # see EDIT later for more elegant solution
      na.replace = function(v,value=x) { v[is.na(v)] = value; v }
      for (i in names(dt))
        eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
    }

f_dowle(set12.dt, "-")

Concatenate the sequences (not included the names with !"rn")

set12<-apply(set12.dt[ ,!"rn"], 1, paste, collapse="")

Convert back to AAStringSet and add back names

set12<-AAStringSet(set12)
names(set12)<-set12.dt$rn

Desired output

set12
 A AAStringSet instance of length 4
    width seq names               
[1]     6 IVR--- org_1
[2]     6 RDGVRT org_2
[3]     6 LKSRKG org_3
[4]     6 ---AST org_4

This works, but seems quite cumbersome, especially converting between different data formats. Obviously can wrap it into a function to use more easily, but again seems like this should already be a function in some Bioconductor package...