In C++, should I bother to cache variables, or let

2019-03-07 16:50发布

Consider the following code (p is of type unsigned char* and bitmap->width is of some integer type, exactly which is unknown and depends on which version of some external library we're using):

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

Is it worth optimizing it [..]

Could there be a case where this could yield more efficient results by writing:

unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

... or is this trivial for the compiler to optimize?

What would you consider to be "better" code?

Note from editor (Ike): for those wondering about the strikeout text, the original question, as phrased, was dangerously close to off-topic territory and was very close to being closed in spite of positive feedback. These have been stricken out. Yet please do not punish the answerers who addressed these stricken sections of the question.

13条回答
聊天终结者
2楼-- · 2019-03-07 17:08

Other answers have pointed out that hoisting the pointer operation out of the loop may change defined behaviour due to aliasing rules that allow char to alias anything and hence is not an allowable optimisation for a compiler even though in most cases it is obviously correct to a human programmer.

They have also pointed out that hoisting the operation out of the loop is usually but not always an improvement from a performance point of view and is often a negative from a readability point of view.

I would like to point out that there is often a "third way". Rather than counting up to the number of iterations you want you can count down to zero. This means that the number of iterations is only needed once at the start of the loop, it doesn't have to be stored after that. Better still at the assembler level it often eliminates the need for an explicit comparison as the decrement operation will usually set flags that indicate whether the counter was zero both before (carry flag) and after (zero flag) the decrement.

for (unsigned x = static_cast<unsigned>(bitmap->width);x > 0;  x--)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

Note that this version of the loop gives x values in the range 1..width rather than the range 0..(width-1). That doesn't matter in your case because you aren't actually using x for anything but it's something to be aware of. If you want a count down loop with x values in the range 0..(width-1) you can do.

for (unsigned x = static_cast<unsigned>(bitmap->width); x-- > 0;)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

You can also get rid of the casts in the above examples if you want without worrying about it's impact on comparison rules since all you are doing with bitmap->width is assigning it directly to a variable.

查看更多
小情绪 Triste *
3楼-- · 2019-03-07 17:15

Ok, guys, so I've measured, with GCC -O3 (using GCC 4.9 on Linux x64).

Turns out, the second version runs 54% faster!

So, I guess aliasing is the thing, I hadn't thought about it.

[Edit]

I've tried again the first version with all pointers defined with __restrict__, and the results are the same. Weird.. Either aliasing is not the problem, or, for some reason, the compiler doesn't optimize it well even with __restrict__.

[Edit 2]

Ok, I think I was pretty much able to prove that aliasing is the problem. I repeated my original test, this time using an array rather than a pointer:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

And measured (had to use "-mcmodel=large" to link it). Then I tried:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

The measure results were the same - Seems like the compiler was able to optimize it by itself.

Then I tried the original codes (with a pointer p), this time when p is of type std::uint16_t*. Again, the results were the same - due to strict aliasing. Then I tried building with "-fno-strict-aliasing", and again saw a difference in time.

查看更多
Fickle 薄情
4楼-- · 2019-03-07 17:16

Unless you know how exactly the compiler optimizes the code, it is better to do your own optimizations by keeping the code readability, and design. Practically it is hard to check assembly code for every function we write for new compiler versions.

查看更多
对你真心纯属浪费
5楼-- · 2019-03-07 17:20

As a general rule, let the compiler do the optimization for you, until you determine that you should take over. The logic for this has nothing to do with performance, but rather with human readability. In the vast majority of cases, the readability of your program is more important than its performance. You should aim to write code which is easier for a human to read, and then only worry about optimization when you are convinced that performance is more important than the maintainability of your code.

Once you do see that performance matters, you should run a profiler on the code to determine which loops are being inefficient, and optimize those individually. There may indeed be cases where you want to do that optimization (especially if you migrate towards C++, where STL containers get involved), but the cost in terms of readability is great.

In addition, I can think of pathological situations where it could actually slow the code down. For example, consider the case where the compiler could not prove that bitmap->width was constant through the process. By adding the width variable you force the compiler to maintain a local variable in that scope. If, for some platform specific reason, that extra variable prevented some stack-space optimization, it may have to reorganize how it is emitting bytecodes, and produce something less efficient.

As an example, on Windows x64, one is obliged to call a special API call, __chkstk in the preamble of the function if the function will use more than 1 page of local variables. This function gives windows a chance to manage the guard pages they use to expand the stack when needed. If your extra variable pushes the stack usage up from below 1 page to at-or-above 1 page, your function is now obliged to call __chkstk every time it is entered. If you were to optimize this loop on a slow path, you may actually slow the fast path down more than you saved on the slow path!

Sure, it's a bit pathological, but the point of that example is that you can actually slow the compiler down. It just shows that you do have to profile your work to determine where the optimizations go. In the mean time, please don't sacrifice readability in any way for an optimization that may or may not matter.

查看更多
Rolldiameter
6楼-- · 2019-03-07 17:24

I use the following pattern in the situation like this. It is almost as short as the first case of yours, and is better than the second case, because it keeps the temporary variable local to the loop.

for (unsigned int x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
{
  *p++ = 0xAA;
  *p++ = 0xBB;
  *p++ = 0xCC;
}

This will be faster with a less than smart compiler, debug build, or certain compilation flags.

Edit1: Placing a constant operation outside of a loop is a good programming pattern. It shows understanding of basics of machine operation, especially in C/C++. I'd argue that the effort to prove themselves should be on people that do not follow this practice. If compiler punishes for a good pattern, it is a bug in the compiler.

Edit2:: I've measured my suggestion against original code on vs2013, got %1 improvement. Can we do better? A simple manual optimization gives 3 times improvement over the original loop on x64 machine without resorting to exotic instructions. The code below assumes little endian system and properly aligned bitmap. TEST 0 is original (9 sec), TEST 1 is faster (3 sec). I bet somebody could make this even faster, and the result of the test would depend on the size of the bitmap. Definitely soon in the future, compiler will be able to produce consistently fastest code. I afraid this will be the future when the compiler will be also a programmer AI, so we would be out of work. But for now, just write code that shows that you know that extra operation in the loop is not needed.

#include <memory>
#include <time.h>

struct Bitmap_line
{
  int blah;
  unsigned int width;
  Bitmap_line(unsigned int w)
  {
    blah = 0;
    width = w;
  }
};

#define TEST 0 //define 1 for faster test

int main(int argc, char* argv[])
{
  unsigned int size = (4 * 1024 * 1024) / 3 * 3; //makes it divisible by 3
  unsigned char* pointer = (unsigned char*)malloc(size);
  memset(pointer, 0, size);
  std::unique_ptr<Bitmap_line> bitmap(new Bitmap_line(size / 3));
  clock_t told = clock();
#if TEST == 0
  for (int iter = 0; iter < 10000; iter++)
  {
    unsigned char* p = pointer;
    for (unsigned x = 0; x < static_cast<unsigned>(bitmap->width); ++x)
    //for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      *p++ = 0xAA;
      *p++ = 0xBB;
      *p++ = 0xCC;
    }
  }
#else
  for (int iter = 0; iter < 10000; iter++)
  {
    unsigned char* p = pointer;
    unsigned x = 0;
    for (const unsigned n = static_cast<unsigned>(bitmap->width) - 4; x < n; x += 4)
    {
      *(int64_t*)p = 0xBBAACCBBAACCBBAALL;
      p += 8;
      *(int32_t*)p = 0xCCBBAACC;
      p += 4;
    }

    for (const unsigned n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      *p++ = 0xAA;
      *p++ = 0xBB;
      *p++ = 0xCC;
    }
  }
#endif
  double ms = 1000.0 * double(clock() - told) / CLOCKS_PER_SEC;
  printf("time %0.3f\n", ms);

  {
    //verify
    unsigned char* p = pointer;
    for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      if ((*p++ != 0xAA) || (*p++ != 0xBB) || (*p++ != 0xCC))
      {
        printf("EEEEEEEEEEEEERRRRORRRR!!!\n");
        abort();
      }
    }
  }

  return 0;
}
查看更多
祖国的老花朵
7楼-- · 2019-03-07 17:25

The comparison is wrong since the two code snippets

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)

and

unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x<width ;  ++x)

are not equivalent

In the first case width is dependent and not const, and one cannot assume that it may not change between subsequent iterations. Thus it cannot be optimized, but has to be checked at every loop.

In your optimized case a local variable is assigned the value of bitmap->width at some point during program execution. The compiler can verify that this does not actually change.

Did you think about multi threading , or maybe the value could be externally dependent such that its value is volatile. How would one expect the compiler to figure all these things out if you do not tell?

The compiler can only do as good as your code lets it.

查看更多
登录 后发表回答