I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I want to count how many of each letter is there, so if I have one I count like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
cat input.txt | grep -v ">" | fold -w1 | sort | uniq -c
6 A
9 C
10 D
1 E
7 F
18 G
5 H
4 I
7 K
21 L
15 N
7 P
6 Q
11 R
16 S
18 T
7 V
8 W
7 Y
however, I want to calculate for all in a better way and more efficient especially when the data is huge
Counting characters in strings can easily be done with awk. To do this, you make use of the function gsub
:
gsub(ere, repl[, in])
Behave like sub
(see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0
or in the in
argument when specified.
sub(ere, repl[, in ])
Substitute the string repl
in place of the first instance of the extended regular expression ERE
in string in and return the number of substitutions. <snip> If in
is omitted, awk shall use the current record ($0
) in its place.
source: Awk Posix Standard
The following two functions perform the counting in this way:
function countCharacters(str) {
while(str != "") { c=substr(str,1,1); a[toupper[c]]+=gsub(c,"",str) }
}
or if there might appear a lot of equal consecutive characters, the following solution might shave off a couple of seconds.
function countCharacters2(str) {
n=length(str)
while(str != "") { c=substr(str,1,1); gsub(c"+","",str);
m=length(str); a[toupper[c]]+=n-m; n=m
}
}
Below you find 4 implementations based on the first function. The first two run on a standard awk, the latter two on an optimized version for fasta-files:
1. Read sequence and processes it line by line:
awk '!/^>/{s=$0; while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) } }
END {for(c in a) print c,a[c]}' file
2. concatenate all sequences and process it in the end:
awk '!/^>/{s=s $0 }
END {while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
3. Same as 1 but use bioawk
:
bioawk -c fastx '{while ($seq!=""){ c=substr($seq,1,1);a[c]+=gsub(c,"",$seq) } }
END{ for(c in a) print c,a[c] }' file
4. Same as 2 but use bioawk
:
bioawk -c fastx '{s=s $seq}
END { while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
Here are some timing results based on this fasta-file
OP : grep,sort,uniq : 47.548 s
EdMorton 1 : awk : 39.992 s
EdMorton 2 : awk,sort,uniq : 53.965 s
kvantour 1 : awk : 18.661 s
kvantour 2 : awk : 9.309 s
kvantour 3 : bioawk : 1.838 s
kvantour 4 : bioawk : 1.838 s
karafka : awk : 38.139 s
stack0114106 1: perl : 22.754 s
stack0114106 2: perl : 13.648 s
stack0114106 3: perl (zdim) : 7.759 s
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
With any awk in any shell on any UNIX box:
$ cat tst.awk
!/^>/ {
for (i=1; i<=length($0); i++) {
cnt[substr($0,i,1)]++
}
}
END {
for (char in cnt) {
print char, cnt[char]
}
}
$ awk -f tst.awk file
A 107
N 67
P 107
C 41
Q 88
D 102
E 132
R 101
F 65
S 168
G 140
T 115
H 52
I 84
V 101
W 27
K 114
Y 30
L 174
M 39
or if you prefer:
$ awk -v ORS= '!/^>/{gsub(/./,"&\n"); print}' file | sort | uniq -c
107 A
41 C
102 D
132 E
65 F
140 G
52 H
84 I
114 K
174 L
39 M
67 N
107 P
88 Q
101 R
168 S
115 T
101 V
27 W
30 Y
Try this Perl solution for better performance.
$ perl -lne '
if( ! /^>/ ) { while(/./g) { $kv{$&}++} }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' learner.txt
A 107
C 41
D 102
E 132
F 65
G 140
H 52
I 84
K 114
L 174
M 39
N 67
P 107
Q 88
R 101
S 168
T 115
V 101
W 27
Y 30
$
One more solution using Perl, optimized for performance.
$ time perl -lne '
if( ! /^>/ ) { for($i=0;$i<length($_);$i++)
{ $x=substr($_,$i,1); $kv{$x}++ } }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' chrY.fa
A 2994088
C 1876822
G 1889305
N 30812232
T 3002884
a 4892104
c 3408967
g 3397589
n 140
t 4953284
real 0m15.922s
user 0m15.750s
sys 0m0.108s
$
Edit with further performance optimizations
All timings reported below are averages over 3-5 runs on a desktop, done at around the same time but swapped around to avoid pronounced cacheing effects.
Changing the C-style for
loop to for my $i (0..length($_))
speeds the second solution from 9.2 seconds to 6.8 seconds.
Then, also removing a scalar ($x
) at each operation, with
if (not /^>/) { for $i (0..length($_)) { ++$kv{ substr($_,$i,1) } } }
speeds this up to 5.3 seconds.
Further reducing variable use, by copying $_
and thus freeing up the loop to use $_
if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }
only helps a little, running at 5.2 seconds.
This compares with the awk
solution, given as kvantour 2
in nice comparisons in kvantour answer, at 6.5 seconds (on this system).
Of course none of this can be compared to the optimized bioawk
(C-code?) program. For that we'd need to write this in C (which is not very hard using Inline C
).
Note that removing a sub call (to substr
) for every character by using
if (not /^>/) { ++$kv{$_} for split //; }
results in "only" a 6.4 seconds average, not as good as the above tweaks; this was a surprise.
These times are on a desktop with v5.16. On v5.24, on the same machine, the best-case (substr
with no extra variables in the loop) time is 4.8 seconds while the one without the substr
(but with split
) is 5.8 seconds. It's nice to see that newer versions of Perl perform better, at least in these cases.
For reference and easy timing by others, complete code for the best run
time perl -lne'
if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' chrY.fa
not sure how much faster this would be but if you try please post your timings
$ awk '!/^>/ {n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]]++}
END {for(k in c) print k,c[k]}' file | sort
A 6
C 9
D 10
E 1
F 7
G 18
H 5
I 4
K 7
L 21
N 15
P 7
Q 6
R 11
S 16
T 18
V 7
W 8
Y 7
this reports counts for the file, not line by line. As noted below, not all awk
's support empty string split.
Here are the timings of the three approaches:
$ time grep -v ">" filey | fold -w1 | sort | uniq -c >/dev/null
real 0m11.470s
user 0m11.746s
sys 0m0.260s
$ time awk '{n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]++]} END{for(k in c) print k,c[k]}' filey >/dev/null
real 0m7.441s
user 0m7.334s
sys 0m0.060s
$ time awk '{n=length($0); for(i=1;i<=n;i++) c[substr($0,i,1)]++} END{for(k in c) print k,c[k]}' filey >/dev/null
real 0m5.055s
user 0m4.979s
sys 0m0.047s
for the test file
$ wc filey
118098 649539 16828965 filey
it surprised me that substr
is faster than split
. Perhaps due to array allocation.