This is my second question of previous one
Faster way to do multi dimensional matrix addition?
After follow the advice of @Peter Cordes i vectorize my code and now the speed has been up by 50X. Then i again did the gprof and found this function is taking most of the time.
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 69.97 1.53 1.53 cal_score(int, std::string, int const*, int, double)
double cal_score(int l, string seq, const int *__restrict__ pw,int cluster,double alpha)
{
const int cols =4;
const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
double score = 0;
char s;
string alphabet="ACGT";
int count=0;
for(int k=0;k<cols;k++)
count=count+pwcluster[k];
for (int i = 0; i < l; i++){
long row_offset = cols*i;
s=seq[i];
//#pragma omp simd
for(int k=0;k<cols;k++) {
if (s==alphabet[k])
score=score+log( ( pwcluster[row_offset+k]+alpha )/(count+4*alpha) );
}
}
return score;
}
I am doing the code optimization first time so don't know how to proceed. So is there any way to write better this function. So i can get some more speed. Input seq is the sequence of character 'ACGT' of length l. pw is one dimensional array of size 2*l*4 or [p][q][r] and cluster is p.
Here's another way to rewrite it. This translates the string with a lookup table instead of a search, and calls
log
fewer times by a factor of 10.This also changes
seq
to aconst char*
passed by reference, instead of astd::string
passed by value. (That would copy the whole string).This compiles to fairly good code, but without
-ffast-math
the division can't be replaced by a multiplication.It doesn't auto-vectorize, because we only load one of every four elements of
pwcluster
.I made some more improvements to Mike's good idea and code.
I also made a vectorized version (requiring SSE4.1). It's more likely to have bugs, but is worth trying because you should get a significant speedup from doing packed multiplies. Porting it to AVX should give another big speedup.
See all the code on godbolt, including a vectorized conversion from ASCII to 0..3 bases (using a pshufb LUT).
My changes:
Don't translate ahead of time. It should overlap well with the FP loop's work, instead of forcing that to wait for a tiny translation loop to finish before the FP work can start.
Simplify the counter variables (gcc makes better code: it was actually keeping
j
in a register, rather than optimizing it away. Or else it was fully unrolling the inner loop into a giant loop.)Pull the scaling by
(count + 4*alpha)
out of the loop entirely: instead of dividing by (or multiplying by the inverse), subtract the logarithm. Since log() grows extremely slowly, we can probably defer this indefinitely without losing too much precision in the finalscore
.An alternative would be only subtract every N iterations, but then the loop would have to figure out whether it terminated early. At the very least, we could multiply by
1.0 / (count + 4*alpha)
, instead of dividing. Without-ffast-math
, the compiler can't do this for you.Have the caller calculate
pwcluster
for us: it probably calculates it for its own use anyway, and we can remove one of the function args (cluster
).row_offset
was making slightly worse code compared to just writingi*cols
. If you like pointer increments as an alternative to array indexing, gcc makes even better code in the inner loop incrementingpwcluster
directly.rename
l
tolen
: single-letter variable names are bad style except in very small scopes. (like a loop, or a very small function that only does one thing), and even then only if there isn't a good short but meaningful name. e.g.p
is no more meaningful thanptr
, butlen
does tell you about what it means, not just what it is.Further observations:
Storing sequences in translated format throughout your program would be better for this and any other code that wants to use the DNA base as an array index or counter.
You can also vectorize translating nucleotide numbers (0..3) to/from ASCII using SSSE3 pshufb. (See my code on godbolt).
Storing your matrix in
float
instead ofint
might possibly be better. Since your code spends most of its time in this function now, it would run faster if it didn't have to keep converting from int to float. On Haswell,cvtss2sd
(single->double) apparently has better throughput thanctvsi2sd
(int->double), but not on Skylake. (ss2sd is slower on SKL than HSW).Storing your matrix in
double
format might be faster, but the doubled cache footprint might be killer. Doing this calculation withfloat
instead ofdouble
would avoid the conversion cost, too. But you could deferlog()
for more iterations withdouble
.Using multiple
product
variables (p1
,p2
, etc.) in a manually unrolled inner loop would expose more parallelism. Multiply them together at the end of the loop. (I ended up making a vectorized version with two vector accumulators).For Skylake or maybe Broadwell, you could vectorize with
VPGATHERDD
. The vectorized translation from ASCII to 0..3 will be helpful here.Even without using a gather instruction, loading two integers into a vector and using a packed conversion instruction would be good. The packed conversion instructions are faster than the scalar ones. We have a lot of multiplies to do, and can certainly take advantage of doing two or four at once with SIMD vectors. See below.
Simple version with my improvements:
See the full code on godbolt, link at the top of this answer.
This compiles to quite good asm, with a pretty compact inner loop:
Using a pointer increment instead of indexing with
i*cols
removes oneadd
from the loop, getting it down to 10 fused-domain uops (vs. 11 in this loop). So it won't matter for frontend throughput from the loop buffer, but will be fewer uops for the execution ports. Resource stalls can make that matter, even when total uop throughput isn't the immediate bottleneck.manually vectorized SSE version:
Not tested, and not that carefully written. I could easily have made some mistakes here. If you're running this on computers with AVX, you should definitely make an AVX version of this. Use
vextractf128
as a first step in a horizontal product or sum, then the same as what I have here.With a vectorized
log()
function to compute two (or four with AVX)log()
results in parallel in a vector, you could just do a horizontal sum at the end, instead of more frequent horizontal products before each scalarlog()
. I'm sure someone's written one, but I'm not going to take the time to search for it right now.Vectorized ASCII->integer DNA base
Ideally, do the translation once when you read in sequences, and internally store them in 0/1/2/3 arrays instead of A/C/G/T ASCII strings.
It can be manually vectorized with pshufb, if we don't have to check for errors (invalid characters). In Mike's code, where we translate the whole input ahead of the FP loop, this can give a big speedup to that part of the code.
For translating on the fly, we could use a vector to:
Since gcc seems to fully unroll the vector loop, this would replace 16
movzx
instructions with 6 vector instructions (including the load and store).