Efficient computation of kronecker products in C

2019-02-16 17:24发布

问题:

I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.

Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.

The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?

The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.

void kron(int *A, int *B, int *C, int vector_size) {

    int i,j;

    for(i = 0; i < vector_size; i++) {
        for (j = 0; j < vector_size; j++) {
            C[i*vector_size+j] = A[i] * B[j];
        }
    }
    return;
}

This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.

I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.

回答1:

For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.



回答2:

Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.

#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
    for (int j = 0; j < vector_size; j++) {
        C[i][j] = A[i] * B[j];
    }
}

This is supported by many compilers nowadays.

You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:

#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
    int const x = A[i];
    int * vec = &C[i][0];
    for (int j = 0; j < vector_size; ++j) {
        vec[j] = x * B[j];
    }
}

BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.



回答3:

If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:

void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);

The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.



回答4:

Solution found (thanks to @Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function

void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {

    int i,j;

    for(i = 0; i < vector_size; i++) {
        for (j = 0; j < vector_size; j++) {
            C[i*vector_size+j] = alpha * A[i] * B[j];
        }
    }
    return;
}

precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.

It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!



回答5:

This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).

You could probably even find a python binding to it, so you can get rid of C.

All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:

  1. Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
  2. Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.


回答6:

Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.

This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).

Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:

int i, j;

for(i = 0; i < vector_size; i++) {
    int d = *A++;
    int *e = B;

    for (j = 0; j < vector_size; j++) {
        *C++ = *e++ * d;
    }
}

This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)



回答7:

To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!

If you have time, go to see its documentation! =)

Good luck !