I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is
a b c d e f
g h i j k l
m n o p q r
I want the result be as follows:
a g m
b h n
c I o
d j p
e k q
f l r
What is the fastest way to do this?
I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is
a b c d e f
g h i j k l
m n o p q r
I want the result be as follows:
a g m
b h n
c I o
d j p
e k q
f l r
What is the fastest way to do this?
I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)
transposing without any overhead (class not complete):
can be used like this:
of course I didn't bother with the memory management here, which is crucial but different topic.
This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.
Consider each row as a column, and each column as a row .. use j,i instead of i,j
demo: http://ideone.com/lvsxKZ
This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.
First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)
Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.
That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first
Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large
n
). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.
For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is
Here is a paper by Intel explaining that http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions
Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use
EDIT:
I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with
block_size=16
(Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).The values
lda
andldb
are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like thisFor 3000x1001 this returns
ldb = 3008
andlda = 1008
Edit:
I found an even faster solution using SSE intrinsics: