I have data in this format:
>abc12
ATCGACAG
>def34
ACCGACG
etc.
I have stored each gene into a dictionary with the lines beginning with > as values. So the dictionary is something like {'abc12':'ATCGACAG', etc.}
Now I want to be able to compare each gene, so that it counts the number of A's, T's, C's, or G's at each site.
The only thing I can come up with is to break the dictionary into lists for each nucleotide site and using zip() with a counter. Is this the best way, and if so, how do I break the dictionary into a list for each site?
Use collections.Counter
:
>>> from collections import Counter
>>> Counter('ATCGACAG')
Counter({'A': 3, 'C': 2, 'G': 2, 'T': 1})
>>> Counter('ACCGACG')
Counter({'C': 3, 'A': 2, 'G': 2})
Is there a reason not to use Biopython?
from Bio import AlignIO
alignment =AlignIO.read("alignment.fas", "fasta")
n=0
while n<len(alignment[0]):
A=alignment[:,n].count('A')
C=alignment[:,n].count('C')
G=alignment[:,n].count('G')
T=alignment[:,n].count('T')
gap=alignment[:,n].count('-')
print "at position %s there are %s A's, %s C's, %s G's, %s T's and %s gaps" % (n, A, C, G, T, gap)
n=n+1
Make sure you have a real alignment (i.e. sequences are the same length).
p.s. I apologize for the ugly formatting of the print statement...
this returns
at position 0 there are 2 A's, 0 C's, 0 G's, 0 T's and 0 gaps
at position 1 there are 0 A's, 1 C's, 0 G's, 1 T's and 0 gaps
at position 2 there are 0 A's, 2 C's, 0 G's, 0 T's and 0 gaps
at position 3 there are 0 A's, 0 C's, 2 G's, 0 T's and 0 gaps
at position 4 there are 2 A's, 0 C's, 0 G's, 0 T's and 0 gaps
at position 5 there are 0 A's, 2 C's, 0 G's, 0 T's and 0 gaps
at position 6 there are 1 A's, 0 C's, 0 G's, 0 T's and 1 gaps
at position 7 there are 0 A's, 0 C's, 2 G's, 0 T's and 0 gaps
s1 = "ATCGACAG"
s2 = "ACCGACG"
alignment = [s1[i] if s1[i] == s2[i] else "-" for i in range(len(min([s1,s2],key=len)))]
print "".join(alignment)
A-CGAC-