Efficient computation of kronecker products in C

I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.

Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.

The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?

The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.

void kron(int *A, int *B, int *C, int vector_size) {

    int i,j;

    for(i = 0; i < vector_size; i++) {
        for (j = 0; j < vector_size; j++) {
            C[i*vector_size+j] = A[i] * B[j];
        }
    }
    return;
}

This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.

I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.

标签： c linear-algebra

7条回答

叼着烟拽天下

2楼-- · 2019-02-16 17:30

This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).

You could probably even find a python binding to it, so you can get rid of C.

All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:

Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.

0人赞添加讨论(0) 举报

放我归山

3楼-- · 2019-02-16 17:32

For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.

0人赞添加讨论(0) 举报

傲

4楼-- · 2019-02-16 17:35

Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.

#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
    for (int j = 0; j < vector_size; j++) {
        C[i][j] = A[i] * B[j];
    }
}

This is supported by many compilers nowadays.

You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:

#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
    int const x = A[i];
    int * vec = &C[i][0];
    for (int j = 0; j < vector_size; ++j) {
        vec[j] = x * B[j];
    }
}

BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.

0人赞添加讨论(0) 举报

霸刀☆藐视天下

5楼-- · 2019-02-16 17:42

To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!

If you have time, go to see its documentation! =)

Good luck !

0人赞添加讨论(0) 举报

放荡不羁爱自由

6楼-- · 2019-02-16 17:44

Solution found (thanks to @Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function

void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {

    int i,j;

    for(i = 0; i < vector_size; i++) {
        for (j = 0; j < vector_size; j++) {
            C[i*vector_size+j] = alpha * A[i] * B[j];
        }
    }
    return;
}

precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.

It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!

0人赞添加讨论(0) 举报

迷人小祖宗

7楼-- · 2019-02-16 17:55

Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.

This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).

Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:

int i, j;

for(i = 0; i < vector_size; i++) {
    int d = *A++;
    int *e = B;

    for (j = 0; j < vector_size; j++) {
        *C++ = *e++ * d;
    }
}

This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)

0人赞添加讨论(0) 举报

1 2 下一页

Efficient computation of kronecker products in C

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间