Im making some intrinsic optimized matrix-wrapper in java(with help of JNI). Needing affirmation of this, can you give some hints about matrix optimizations? What Im going to implement is:
Matrix can be represented as four set of buffers/arrays, one for horizontal accessing, one for vertical accessing, one for diagonal access and a command buffer to compute elements of matrix only when needed. Here is an illustration.
Matrix signature:
0 1 2 3
4 5 6 7
8 9 1 3
3 5 2 9
First(hroizontal) set:
horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9}
Second(vertical) set:
verSet[0]={0,4,8,3} verSet[1]={1,5,9,5} verSet[2]={2,6,1,2} verSet[3]={3,7,3,9}
Third(optional) a diagonal set:
diagS={0,5,1,9} //just in case some calculation needs this
Fourth(calcuation list, in a "one calculation one data" fashion) set:
calc={0,2,1,3,2,5} --->0 means multiply by the next element
1 means add the next element
2 means divide by the next element
so this list means
( (a[i]*2)+3 ) / 5 when only a[i] is needed.
Example for fourth set:
A.mult(2), A.sum(3), A.div(5), A.mult(B)
(to list) (to list) (to list) (calculate *+/ just in time when A is needed )
so only one memory access for four operations.
loop start
a[i] = b[i] * ( ( a[i]*2) +3 ) / 5 only for A.mult(B)
loop end
So as seen above, when one needs to access column elements,second sets provide contiguous accesses. No leaps taken. Same thing achieved with first set for horizontal access.
This should give some things easier and some things harder:
Easier:
**Matrix transpozing operation.
Just swapping the pointers horSet[x] and verSet[x] is enough.
**Matrix * Matrix multiplication.
One matrix gives one of its horizontal set and other matrix gives vertical buffer.
Dot product of these must be highly parallelizable for intrinsics/multithreading.
If the multiplication order is inverse, then horizontal and verticals are switched.
**Matrix * vector multiplication.
Same as above, just a vector can be taken as horizontal or vertical freely.
Harder:
** Doubling memory requirement is bad for many cases.
** Initializing a matrix takes longer.
** When a matrix is multiplied from left, needs an update vertical-->horizontal
sets if its going to be multiplied from right after.(same for opposite)
(if a tranposition is taken between, this does not count)
Neutral:
** Same matrix can be multiplied with two other matrices to get two different
results such as A=A*B(saved in horizontal sets) A=C*A(saved in vertical sets)
then A=A*A gives A*B*C*A(in horizontal) and C*A*A*B (in vertical) without
copying A.
** If a matrix always multiplied from left or always from right, every access
and multiplication will not need update and be contiguous on ram.
** Only using horizontals before transpozing, only using verticals after,
should not break any rules.
Main purpose is having a matrix of (multiple of 8, multiple of 8) size and applying avx intrinsics with multiple threads( each tread works on a set concurrently).
I have achieved only vector * vector dotproduct. I will go into this if you masters of programming give a direction.
The dotproduct I wrote(with intrinsics) is 6x faster than a loop-unrolled version(which is two times as fast of multiplication one by one), also stucks at memory bandwidth cap when multithreading enabled in wrapper (8x --> uses nearly 20GB/s which is close to my ddr3's limit) Tried opencl already and it is kind of slow for cpu, but great for gpu.
Thank you.
Edit: How would a "Block Matrix" buffer perform? When multiplying big matrices, small patches are multiplied in a special way and cache is probably used for main memory access reduction. But this would need even more updates between matrix multiplications between vertical-horizontal-diagonal and this block.
Several libraries use Expression Templates to enable the application of very specific, optimized functions for a cascade of matrix operations.
The C++ Programming Lanuage also has a short chapter about "Fused Operations" (29.5.4, 4th Edition).
This enables concatenation of statements à la:
You'll want to have 3 classes in this case:
You can advance this concept to be able to have a spcialized function for every computation that is critcal by any mean.
This is effectively equivalent to cacheing the transposition. It sounds like you intend to do this eagerly; I'd just compute the transposition only when it is needed and remember it in case it is needed again. That way, if you never need it then it never gets computed.