In C++, should I bother to cache variables, or let-第2页回答

Consider the following code (p is of type unsigned char* and bitmap->width is of some integer type, exactly which is unknown and depends on which version of some external library we're using):

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

_{Is it worth optimizing it [..]}

Could there be a case where this could yield more efficient results by writing:

unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

... or is this trivial for the compiler to optimize?

_{What would you consider to be "better" code?}

_{Note from editor (Ike): for those wondering about the strikeout text, the original question, as phrased, was dangerously close to off-topic territory and was very close to being closed in spite of positive feedback. These have been stricken out. Yet please do not punish the answerers who addressed these stricken sections of the question.}

标签： c++ performance caching optimization strict-aliasing

13条回答

该账号已被封号

2楼-- · 2019-03-07 17:30

At first glance, I thought the compiler could generate equivalent assembly for both versions with optimization flags activated. When I checked it, I was surprised to see the result:

Source `unoptimized.cpp`

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    for (unsigned x = 0 ; x < static_cast<unsigned>(bitmap.width) ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}

Source `optimized.cpp`

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    const unsigned width = static_cast<unsigned>(bitmap.width);
    for (unsigned x = 0 ; x < width ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}

Compilation

$ g++ -s -O3 unoptimized.cpp
$ g++ -s -O3 optimized.cpp

Assembly (unoptimized.s)

    .file   "unoptimized.cpp"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L3:
    mov %eax, %edx
    addl    $1, %eax
    movq    (%rsi,%rdx,8), %rdx
    movb    $0, (%rdx)
    cmpl    bitmap(%rip), %eax
    jb  .L3
.L2:
    xorl    %eax, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
.globl bitmap
    .bss
    .align 8
    .type   bitmap, @object
    .size   bitmap, 8
bitmap:
    .zero   8
    .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
    .section    .note.GNU-stack,"",@progbits

Assembly (optimized.s)

    .file   "optimized.cpp"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
    subl    $1, %eax
    leaq    8(,%rax,8), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L3:
    movq    (%rsi,%rax), %rdx
    addq    $8, %rax
    cmpq    %rcx, %rax
    movb    $0, (%rdx)
    jne .L3
.L2:
    xorl    %eax, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
.globl bitmap
    .bss
    .align 8
    .type   bitmap, @object
    .size   bitmap, 8
bitmap:
    .zero   8
    .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
    .section    .note.GNU-stack,"",@progbits

diff

$ diff -uN unoptimized.s optimized.s
--- unoptimized.s   2015-11-24 16:11:55.837922223 +0000
+++ optimized.s 2015-11-24 16:12:02.628922941 +0000
@@ -1,4 +1,4 @@
-   .file   "unoptimized.cpp"
+   .file   "optimized.cpp"
    .text
    .p2align 4,,15
 .globl main
@@ -10,16 +10,17 @@
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
+   subl    $1, %eax
+   leaq    8(,%rax,8), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
 .L3:
-   mov %eax, %edx
-   addl    $1, %eax
-   movq    (%rsi,%rdx,8), %rdx
+   movq    (%rsi,%rax), %rdx
+   addq    $8, %rax
+   cmpq    %rcx, %rax
    movb    $0, (%rdx)
-   cmpl    bitmap(%rip), %eax
-   jb  .L3
+   jne .L3
 .L2:
    xorl    %eax, %eax
    ret

The generated assembly for the optimized version does actually load (lea) the width constant unlike the unoptimized version which computes the width offset at each iteration (movq).

When I'll get time, I eventually post some benchmark on that. Good question.

0人赞添加讨论(0) 举报

Juvenile、少年°

3楼-- · 2019-03-07 17:31

There is actually insufficient information from your code snippet to be able to tell, and the one thing that I can think of is aliasing. From our point of view, it's pretty clear that you don't want p and bitmap to point to the same location in memory, but the compiler doesn't know that and (because p is of type char*) the compiler has to make this code work even if p and bitmap overlap.

This means in this case that if the loop changes bitmap->width through the pointer p then that has to be seen when re-reading bitmap->width later on, which in turn means that storing it in a local variable would be illegal.

That being said, I believe some compilers will actually sometimes generate two versions of the same code (I have seen circumstantial evidence of this, but never directly sought out information on what the compiler is doing in this case), and quickly check if the pointers alias and run the faster code if it determines it's okay to.

That being said, I stand by my comment about simply measuring the performance of the two versions, my money is on not seeing any consistent performance difference between the two versions of the code.

In my opinion, questions like these are okay if your purpose is to learn about compiler optimization theories and techniques, but is a waste of time (a useless micro-optimization) if your end goal here is to make the program run faster.

0人赞添加讨论(0) 举报

我只想做你的唯一

4楼-- · 2019-03-07 17:32

The only thing here that can prevent the optimization is the strict aliasing rule. In short:

"Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias each other.)"

[…]

The exception to the rule is a char*, which is allowed to point to any type.

The exception also applies to unsigned and signed char pointers.

This is the case in your code: You're modifying *p through p which is an unsigned char*, so the compiler must assume that it could point to bitmap->width. Hence the caching of bitmap->width is an invalid optimization. This optimization-preventing behavior is shown in YSC's answer.

If and only if p pointed to a non-char and non-decltype(bitmap->width) type, would the caching be a possible optimization.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

5楼-- · 2019-03-07 17:33

The compiler is able to optimize a lot of things. For your example, you should go for the readability, mantainability and what follows your code standard. For more information about what can be optimized (with GCC), see this blog post.

0人赞添加讨论(0) 举报

狗以群分

6楼-- · 2019-03-07 17:34

The question originally asked:

Is it worth optimizing it?

And my answer to that (garnering a good mix of both up and down votes..)

Let the compiler worry about it.

The compiler will almost certainly do a better job than you. And there's no guarantee that your 'optimization' is any better than the 'obvious' code - have you measured it??

More importantly, have you any proof that the code you're optimizing has any impact on the performance of your program?

Despite the downvotes (and now seeing the aliasing issue), I'm still happy with that as a valid answer. If you don't know if it's worth optimizing something, it probably isn't.

A rather different question, of course, would be this:

How can I tell if it's worth optimizing a fragment of code?

First, does your application or library need to run faster than it currently does? Is the user kept waiting too long? Does your software forecast yesterday's weather instead of tomorrow's?

Only you can really tell this, based on what your software is for and what your users expect.

Assuming your software does need some optimzation, the next thing to do is start measuring. Profilers will tell you where your code spends it's time. If your fragment isn't showing as a bottleneck, it's best left alone. Profilers and other measuring tools will also tell you if your changes have made a difference. It's possible to spend hours attemtping to optimize code, only to find you've made no discernible difference.

What do you mean by 'optimizing', anyway?

If you're not writing 'optimized' code, than your code should be as clear, clean and concise as you can make it. The "Premature optimization is evil" argument isn't an excuse for sloppy or inefficient code.

Optimized code normally sacrifices some of the attributes above for performance. It could involve introducing additional local variables, having objects with wider than expected scope or even reversing normal loop ordering. All of these may be less clear or concise, so document the code (briefly!) about why you're doing this.

But often, with 'slow' code, these micro-optimizations are the last resort. The first place to look is at algorithms and data structures. Is there a way of avoiding doing the work at all? Can linear searches be replaced with binary ones? Would a linked list be faster here than a vector? Or a hash table? Can I cache results? Making good 'efficient' decisions here can often affect performance by an order of magnitude or more!

0人赞添加讨论(0) 举报

混吃等死

7楼-- · 2019-03-07 17:34

There are two things to consider.

A) How often will the optimization run?

If the answer is not very often, like only when a user clicks a button, then don't bother if it makes your code unreadable. If the answer is 1000 times a second then you will probably want to go with the optimization. If it is even a bit complex be sure to put a comment in to explain what is going on to help the next guy that comes along.

B) Will this make the code harder to upkeep/troubleshoot?

If you're not seeing a huge gain in performance then making your code cryptic simply to save a few clock ticks is not a good idea. Lots of people will tell you that any good programmer should be able to look at the code and figure out what is going on. This is true. The problem is that in the business world the extra time figuring that out costs money. So, if you can make it prettier to read then do it. Your friends will thank you for it.

That said I'd personally use the B example.

0人赞添加讨论(0) 举报