I would like to compute how many flops each layer of LeNet-5 (paper) needs. Some papers give FLOPs for other architectures in total (1, 2, 3) However, those papers don't give details on how to compute the number of FLOPs and I have no idea how many FLOPs are necessary for the non-linear activation functions. For example, how many FLOPs are necessary to calculate tanh(x)
?
I guess this will be implementation and probably also hardware-specific. However, I am mainly interested in getting an order of magnitude. Are we talking about 10 FLOPs? 100 FLOPs? 1000 FLOPs? So chose any architecture / implementation you want for your answer. (Although I'd appreciate answers which are close to "common" setups, like an Intel i5 / nvidia GPU / Tensorflow)
Note: This answer is not python specific, but I don't think that something like tanh is fundamentally different across languages.
Tanh is usually implemented by defining an upper and lower bound, for which 1 and -1 is returned, respectively. The intermediate part is approximated with different functions as follows:
There exist polynomials that are accurate up to single precisision floating points, and also for double precision. This algorithm is called Cody-Waite algorithm.
Citing this description (you can find more information about the mathematics there as well, e.g. how to determine x_medium), Cody and Waite’s rational form requires four multiplications, three additions, and one division in single precision, and seven multiplications, six additions, and one division in double precision.
For negative x, you can compute |x| and flip the sign. So you need comparisons for which interval x is in, and evaluate the according approximation. That's a total of:
Now, this is a report from 1993, but I don't think much has changed here.
If we look at the glibc-implementation of
tanh(x)
, we see:x
values greater 22.0 and double precision,tanh(x)
can be safely assumed to be 1.0, so there are almost no costs.x
, (let's sayx<2^(-55)
) another cheap approximation is possible:tanh(x)=x(1+x)
, so only two floating point operations are needed.tanh(x)=(1-exp(-2x))/(1+exp(-2x))
. However, one must be accurate, because1-exp(t)
is very problematic for small t-values due to loss of significance, so one usesexpm(x)=exp(x)-1
and calculatestanh(x)=-expm1(-2x)/(expm1(-2x)+2)
.So basically, the worst case is about 2 times the number of operation needed for
expm1
, which is a pretty complicated function. The best way is probably just to measure the time needed to calculatetanh(x)
compared with a time needed for a simple multiplication of two doubles.My (sloppy) experiments on an Intel-processor yielded the following result, which gives a rough idea:
So for very small and numbers >22 there are almost no costs, for numbers up to
0.1
we pay 6 FLOPS, then the costs rise to about 20 FLOPS pertanh
-caclulation.The key takeaway: the costs of calculating
tanh(x)
are dependent on the parameterx
and maximal costs are somewhere between 10 and 100 FLOPs.There is an Intel-instruction called
F2XM1
which computes2^x-1
for-1.0<x<1.0
, which could be used for computingtanh
, at least for some range. However if agner's tables are to be believed, this operation's costs are about 60 FLOPs.Another problem is the vectorization - the normal glibc-implementation is not vectorized, as far as I can see. So if your program uses vectorization and has to use an unvectorized
tanh
implementation it will slowdown the program even more. For this, the intel compiler has the mkl-library which vectorizestanh
among the others.As you can see in the tables the maximal costs are about 10 clocks per operation (costs of a float-operation is about 1 clock).
I guess there is a chance you could win some FLOPs by using
-ffast-math
compiler option, which results in a faster but less precise program (that is an option for Cuda or c/c++, not sure whether this can be done for python/numpy).The c++ code which produced the data for the figure (compiled with g++ -std=c++11 -O2). Its intend is not to give the exact number, but the first impression about the costs: