I'm trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
I tried to answser this question using the following Erlang code:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
Compilation/Execution:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
this code seems to work fine for a small fasta file but it takes hours to parse a larger one (>100Mo). Why ? I'm an Erlang newbie, can you please improve this code ?
The call
string:len(string:strip(L))
traverses the list at least twice (I'm unaware of the string:strip implementation). Instead you could write a simple function to count the line length w/0 the spaces:The same method can be applied to binaries as well.
I too am learning Erlang, thanks for the fun question.
I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.
Also, if you don't mind working with a file directly rather than
standard_io
, perhaps you could speed things along by usingfile:open(..., [raw, read_ahead])
.raw
means the file must be on the local node's filesystem, andread_ahead
specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)I'd expect the
read_ahead
to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".EDIT
Using
file:open("uniprot_sprot.fasta", [read, read_ahead])
gets1m31s
on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)Using
file:open(.., [read, read_ahead])
andfile:read_line(S)
, I get0m34s
.Using
file:open(.., [read, read_ahead, raw])
andfile:read_line(S)
, I get0m9s
. Yes, nine seconds.Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:
Did you try Elixir (elixir-lang.org) which is runs on top of Erlang and has a syntax similar to Ruby. Elixir solves String problems in the following way:
Just wonder whether Elixir would be faster?
It looks like your big performance problems have been solved by opening the file in raw mode, but here's some more thoughts if you need to optimise that code further.
Learn and use fprof.
You're using
string:strip/1
primarily to remove the trailing newline. As erlang values are immutable you have to make a complete copy of the list (with all the associated memory allocation) just to remove the last character. If you know the file is well formed, just subtract one from your count, otherwise I'd try writing a length function the counts the number of relevant characters and ignores irrelevant ones.I'm wary of advice that says binaries are better than lists, but given how little processing you it's probably the case here. The first steps are to open the file in binary mode and using
erlang:size/1
to find the length.It won't affect performance (significantly), but the multiplication by 1.0 in
Total/(1.0*Sequences)
is only necessary in languages with broken division. Erlang division works correctly.If you need really fast IO then you have to do little bit more trickery than usual.
It is fastest IO as I know but note
-noshell -noinput
. Compile just likeerlc +native +"{hipe, [o3]}" g.erl
but with-smp disable
and run:
With
-smp enable
but native it takes:Byte code but with
-smp disable
(almost in par with native because most of work is done in port!):Just for completeness byte code with smp:
For comparison sarnold version gives me wrong answer and takes more on same HW:
EDIT: I have looked at characteristics of
uniprot_sprot.fasta
and I'm little bit surprised. It is 3824397 rows and 232MB. It means that-smp disabled
version can handle 1.18 million text lines per second (71MB/s in line oriented IO).