-->

Nucleotides separator in the pairwise sequence ali

2019-09-09 22:36发布

问题:

I have RNA sequences that contain different modified nucleotides and residues. Some of them for example N79, 8XU, SDG, I.

I want to pairwise align them using biopython's pairwise2.align.localms. Is it possible to make input not as a string but as list for example in order to accurately account for these modified bases?

What is the correct technique?

回答1:

Biopython's pairwise2 module works on strings of letters, which can be anything - for example:

>>> from Bio import pairwise2
>>> from Bio.pairwise2 import format_alignment
>>> for a in pairwise2.align.localms("ACCGTN97CT", "ACCG8DXCT", 2, -1, -.5, -.1):
...     print(format_alignment(*a))
... 
ACCG--TN97CT
||||||||||||
ACCG8DX---CT
  Score=9.7

ACCGTN97--CT
||||||||||||
ACCG---8DXCT
  Score=9.7

You can set the match/mismatch scores according to your needs. However, this assumes each letter is a separate element.

It was not clear in your question if your example N79 was one modified nucleotide, or three? If you wanted to treat N79 as one base it does seem to be possible: I don't think it was intentional (so I wouldn't want to depend on this behaviour), but I could trick pairwise2 into working on lists of strings:

>>> for a in pairwise2.align.localms(["A", "C", "C", "G", "T", "N97", "C", "T"], ["A", "C", "C", "G", "8DX", "C", "T"], 2, -1, -.5, -.1, gap_char=["-"]):
...     print(format_alignment(*a))                                                                                                                  ... 
['A', 'C', 'C', 'G', 'T', 'N97', 'C', 'T']
||||||||
['A', 'C', 'C', 'G', '8DX', '-', 'C', 'T']
  Score=10.5

['A', 'C', 'C', 'G', 'T', 'N97', 'C', 'T']
||||||||
['A', 'C', 'C', 'G', '-', '8DX', 'C', 'T']
  Score=10.5

Notice the default format_alignment function does not display this very well.



回答2:

Sorry for adding another answer, but my reputation is not good enough for just adding comments...

To elaborate on peterjc's answer, accepting lists as input is the intended behaviour of pairwise2 (and now I understand what it may be good for...).

And you are right, it's also about the gap_char argument: Since your are applying the sequence as a list, the gap character must also be defined as a list (["-"]).