What is the fastest way to transpose a matrix in C

2019-01-01 10:06发布

站内文章 / C++

50 0

还给你的自由

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is

a b c d e f
g h i j k l
m n o p q r

I want the result be as follows:

a g m
b h n
c I o
d j p
e k q
f l r

What is the fastest way to do this?

回答1:

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.

First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let\'s see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that\'s much more complicated.

I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don\'t know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.

For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is

Smear image horizontally
transpose output 
Smear output horizontally
transpose output

Here is a paper by Intel explaining that http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use

void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
    #pragma omp parallel for
    for(int n=0; n<M*N; n++) {
        int k = vec_size*(n/N/vec_size);
        int i = (n/vec_size)%N;
        int j = n%vec_size;
        B[n] = A[M*i + k + j];
    }
}

EDIT:

I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).

inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<block_size; i++) {
        for(int j=0; j<block_size; j++) {
            B[j*ldb + i] = A[i*lda +j];
        }
    }
}

inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
        }
    }
}

The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this

#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);

float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

For 3000x1001 this returns ldb = 3008 and lda = 1008

Edit:

I found an even faster solution using SSE intrinsics:

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }
}

回答2:

This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.

回答3:

Some details about transposing 4x4 square float (I will discuss 32-bit integer later) matrices with x86 hardware. It\'s helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16.

_MM_TRANSPOSE4_PS(r0, r1, r2, r3) is implemented differently by different compilers. GCC and ICC (I have not checked Clang) use unpcklps, unpckhps, unpcklpd, unpckhpd whereas MSVC uses only shufps. We can actually combine these two approaches together like this.

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

r0 = _mm_shuffle_ps(t0,t2, 0x44);
r1 = _mm_shuffle_ps(t0,t2, 0xEE);
r2 = _mm_shuffle_ps(t1,t3, 0x44);
r3 = _mm_shuffle_ps(t1,t3, 0xEE);

One interesting observation is that two shuffles can be converted to one shuffle and two blends (SSE4.1) like this.

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

v  = _mm_shuffle_ps(t0,t2, 0x4E);
r0 = _mm_blend_ps(t0,v, 0xC);
r1 = _mm_blend_ps(t2,v, 0x3);
v  = _mm_shuffle_ps(t1,t3, 0x4E);
r2 = _mm_blend_ps(t1,v, 0xC);
r3 = _mm_blend_ps(t3,v, 0x3);

This effectively converted 4 shuffles into 2 shuffles and 4 blends. This uses 2 more instructions than the implementation of GCC, ICC, and MSVC. The advantage is that it reduces port pressure which may have a benefit in some circumstances. Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports.

I tried using 8 shuffles like MSVC and converting that into 4 shuffles + 8 blends but it did not work. I still had to use 4 unpacks.

I used this same technique for a 8x8 float transpose (see towards the end of that answer). https://stackoverflow.com/a/25627536/2542702. In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends.

For 32-bit integers there is nothing like shufps (except for 128-bit shuffles with AVX512) so it can only be implemented with unpacks which I don\'t think can be convert to blends (efficiently). With AVX512 vshufi32x4 acts effectively like shufps except for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4 in some cases. With Knights Landing shuffles are four times slower (throughput) than blends.

回答4:

template <class T>
void transpose( std::vector< std::vector<T> > a,
std::vector< std::vector<T> > b,
int width, int height)
{
    for (int i = 0; i < width; i++)
    {
        for (int j = 0; j < height; j++)
        {
            b[j][i] = a[i][j];
        }
    }
}

回答5:

Consider each row as a column, and each column as a row .. use j,i instead of i,j

demo: http://ideone.com/lvsxKZ

#include <iostream> 
using namespace std;

int main ()
{
    char A [3][3] =
    {
        { \'a\', \'b\', \'c\' },
        { \'d\', \'e\', \'f\' },
        { \'g\', \'h\', \'i\' }
    };

    cout << \"A = \" << endl << endl;

    // print matrix A
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[i][j];
        cout << endl;
    }

    cout << endl << \"A transpose = \" << endl << endl;

    // print A transpose
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[j][i];
        cout << endl;
    }

    return 0;
}

回答6:

transposing without any overhead (class not complete):

class Matrix{
   double *data; //suppose this will point to data
   double _get1(int i, int j){return data[i*M+j];} //used to access normally
   double _get2(int i, int j){return data[j*N+i];} //used when transposed

   public:
   int M, N; //dimensions
   double (*get_p)(int, int); //functor to access elements  
   Matrix(int _M,int _N):M(_M), N(_N){
     //allocate data
     get_p=&Matrix::_get1; // initialised with normal access 
     }

   double get(int i, int j){
     //there should be a way to directly use get_p to call. but i think even this
     //doesnt incur overhead because it is inline and the compiler should be intelligent
     //enough to remove the extra call
     return (this->*get_p)(i,j);
    }
   void transpose(){ //twice transpose gives the original
     if(get_p==&Matrix::get1) get_p=&Matrix::_get2;
     else get_p==&Matrix::_get1; 
     swap(M,N);
     }
}

can be used like this:

Matrix M(100,200);
double x=M.get(17,45);
M.transpose();
x=M.get(17,45); // = original M(45,17)

of course I didn\'t bother with the memory management here, which is crucial but different topic.

回答7:

I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)

回答8:

my answer is transposed of 3x3 matrix

 #include<iostream.h>

#include<math.h>


main()
{
int a[3][3];
int b[3];
cout<<\"You must give us an array 3x3 and then we will give you Transposed it \"<<endl;
for(int i=0;i<3;i++)
{
    for(int j=0;j<3;j++)
{
cout<<\"Enter a[\"<<i<<\"][\"<<j<<\"]: \";

cin>>a[i][j];

}

}
cout<<\"Matrix you entered is :\"<<endl;

 for (int e = 0 ; e < 3 ; e++ )

{
    for ( int f = 0 ; f < 3 ; f++ )

        cout << a[e][f] << \"\\t\";


    cout << endl;

    }

 cout<<\"\\nTransposed of matrix you entered is :\"<<endl;
 for (int c = 0 ; c < 3 ; c++ )
{
    for ( int d = 0 ; d < 3 ; d++ )
        cout << a[d][c] << \"\\t\";

    cout << endl;
    }

return 0;
}

标签： c++ algorithm matrix transpose

还给你的自由

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

What is the fastest way to transpose a matrix in C

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

回答8:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮