Template-ing a 'for' loop in C++?

2019-03-10 21:13发布

I have a C++ snippet below with a run-time for loop,

for(int i = 0; i < I; i++)
  for (int j = 0; j < J; j++)
    A( row(i,j), column(i,j) ) = f(i,j);

The snippet is called repeatedly. The loop bounds 'I' and 'J' are known at compile time (I/J are the order of 2 to 10). I would like to unroll the loops somehow using templates. The main bottleneck is the row() and column() and f() functions. I would like to replace them with equivalent metaprograms that are evaluated at compile-time, using row<i,j>::enum tricks.

What I'd really love is something that eventually resolves the loop into a sequence of statements like:

A(12,37) = 0.5;
A(15,23) = 0.25;
A(14,45) = 0.25;

But I'd like to do so without wrecking the for-for structure too much. Something in the spirit of:

TEMPLATE_FOR<i,0,I>
  TEMPLATE_FOR<j,0,J>
     A( row<i,j>::value, column<i,j>::value ) = f<i,j>::value

Can boost::lambda (or something else) help me create this?

10条回答
在下西门庆
2楼-- · 2019-03-10 21:21

I've never tried to do this so take this idea with a grain of salt...

It seems like you could use Boost.Preprocessor to do the loop unrolling (specifically the BOOST_PP_FOR and BOOST_PP_FOR_r macros) and then use templates to generate the actual constant expression.

查看更多
可以哭但决不认输i
3楼-- · 2019-03-10 21:21

I'm not a fan of template meta-programming, so you may want to take this answer with a pinch of salt. But, before I invested any time on this problem, I'd ask myself the following:

  1. Is my for loop a bottleneck?
  2. Is unrolling it going to make any difference? Easily tested by hand-unrolling and measuring.

In many compilers/cpus, the "looped" version can give better performance due to cache effects.

Remember: Measure first, optimise later - if at all.

查看更多
在下西门庆
4楼-- · 2019-03-10 21:24

You could use Boost.Mpl to implement the whole thing at compile-time, but I'm not sure it'll be faster. (Mpl essentially re-implements all the STL algorithms as compile-time metaprogramming templates)

The problem with that approach is that you end up unrolling and inlining a lot of code, which may thrash the instruction cache and eat up memory bandwidth that could have been saved. That may produce huge, bloated and slow code.

I would probably probably rather trust the compiler to inline the functions that make sense. As long as the row and column function definitions are visible from the loop, the compiler can trivially inline the calls and unroll as many iterations as it deems beneficial.

查看更多
不美不萌又怎样
5楼-- · 2019-03-10 21:24

If you are willing to modify the syntax a bit you can do something like this:

template <int i, int ubound>
struct OuterFor {
    void operator()() {
        InnerFor<i, 0, J>()();
        OuterFor<i + 1, ubound>()();
    }
};

template <int ubound>
struct OuterFor <ubound, ubound> {
    void operator()() {
    }
};

In InnerFor, i is the outer loops counter (compile time constant), j is the inner loops counter (initially 0 - also compile time constant), so you can evaluate row as a compile time template.

Its a bit more complicated, but as you say, row(), col(), and f() are your complicated parts anyways. At least try it and see if the performance is worth it. It may be worth it to investigate other options to simplify your row(), etc functions.

查看更多
虎瘦雄心在
6楼-- · 2019-03-10 21:28

A good compiler should do unrolling for you. For instance, in gcc compiling with the -O2 option turns on loop unrolling.

If you try to do it yourself manually, unless you measure things carefully and really know what you are doing, you are liable to end up with slower code. For example, in your case with manual unrolling you are liable to prevent the compiler from being able to do a loop interchange or stripmine optimization (look for --floop-interchange and -floop-strip-mine in the gcc docs)

查看更多
Deceive 欺骗
7楼-- · 2019-03-10 21:33

f would need to return a double - that can't be done at compile time.

查看更多
登录 后发表回答