Why don't C++ compilers optimize away reads an

I'm trying to create a local array of some POD values (e.g. double) with fixed max_size that is known at compile time, then read a runtime size value (size <= max_size) and process first size elements from that array.

The question is, why doesn't compiler eliminate stack reads and writes when arr and size are placed into the same struct/class, as opposed to the case where arr and size are independent local variables?

Here's my code:

#include <cstddef>
constexpr std::size_t max_size = 64;

extern void process_value(double& ref_value);

void test_distinct_array_and_size(std::size_t size)
{
    double arr[max_size];
    std::size_t arr_size = size;

    for (std::size_t i = 0; i < arr_size; ++i)
        process_value(arr[i]);
}

void test_array_and_size_in_local_struct(std::size_t size)
{
    struct
    {
        double arr[max_size];
        std::size_t size;
    } array_wrapper;
    array_wrapper.size = size;

    for (std::size_t i = 0; i < array_wrapper.size; ++i)
        process_value(array_wrapper.arr[i]);
}

Assembly output for test_distinct_array_and_size from Clang with -O3:

test_distinct_array_and_size(unsigned long): # @test_distinct_array_and_size(unsigned long)
  push r14
  push rbx
  sub rsp, 520
  mov r14, rdi
  test r14, r14
  je .LBB0_3
  mov rbx, rsp
.LBB0_2: # =>This Inner Loop Header: Depth=1
  mov rdi, rbx
  call process_value(double&)
  add rbx, 8
  dec r14
  jne .LBB0_2
.LBB0_3:
  add rsp, 520
  pop rbx
  pop r14
  ret

Assembly output for test_array_and_size_in_local_struct:

test_array_and_size_in_local_struct(unsigned long): # @test_array_and_size_in_local_struct(unsigned long)
  push r14
  push rbx
  sub rsp, 520
  mov qword ptr [rsp + 512], rdi
  test rdi, rdi
  je .LBB1_3
  mov r14, rsp
  xor ebx, ebx
.LBB1_2: # =>This Inner Loop Header: Depth=1
  mov rdi, r14
  call process_value(double&)
  inc rbx
  add r14, 8
  cmp rbx, qword ptr [rsp + 512]
  jb .LBB1_2
.LBB1_3:
  add rsp, 520
  pop rbx
  pop r14
  ret

Latest GCC and MSVC compilers do basically the same thing with the stack reads and writes.

As we can see, reads and writes to the array_wrapper.size variable on the stack are not optimized away in the latter case. There is a write of size value into location [rsp + 512] before the beginning of the loop, and a read from that location after each iteration.

So, the compiler kinda expects that we'd want to modify array_wrapper.size from the process_value(array_wrapper.arr[i]) call (by taking the address of the current array element and applying some weird offsets to it?)

But, if we tried to do so from that call, woudn't that be undefined behavior?

When we rewrite the loop in the following manner

for (std::size_t i = 0, sz = array_wrapper.size; i < sz; ++i)
    process_value(array_wrapper.arr[i]);

, those unnecessary reads at the end of each iteration will be gone. But the initial write to [rsp + 512] will remain, meaning that compiler still expects us to be able to access the array_wrapper.size variable in that location from these process_value calls (by doing some weird offset-based magic).

Why?

Is that but a little shortcoming in modern compilers' implementations (that hopefully will be fixed soon)? Or does the C++ standard indeed require such behavior that leads to generation of less efficient code whenever we put an array and its size into the same class?

P.S.

I realize that my code example above might seem a bit contrived. But consider this: I'd like to use a lightweight boost::container::static_vector-like class template in my code for safer and more convenient "C++-style" manipulations with pseudo-dynamic arrays of POD elements. So my PODVector will contain an array and a size_t in the same class:

template<typename T, std::size_t MaxSize>
class PODVector
{
    static_assert(std::is_pod<T>::value, "T must be a POD type");

private:
    T _data[MaxSize];
    std::size_t _size = 0;

public:
    using iterator = T *;

public:
    static constexpr std::size_t capacity() noexcept
    {
        return MaxSize;
    }

    constexpr PODVector() noexcept = default;

    explicit constexpr PODVector(std::size_t initial_size)
        : _size(initial_size)
    {
        assert(initial_size <= capacity());
    }

    constexpr std::size_t size() const noexcept
    {
        return _size;
    }

    constexpr void resize(std::size_t new_size)
    {
        assert(new_size <= capacity());
        _size = new_size;
    }

    constexpr iterator begin() noexcept
    {
        return _data;
    }

    constexpr iterator end() noexcept
    {
        return _data + _size;
    }

    constexpr T & operator[](std::size_t position)
    {
        assert(position < _size);
        return _data[position];
    }
};

Usage:

void test_pod_vector(std::size_t size)
{
    PODVector<double, max_size> arr(size);

    for (double& val : arr)
        process_value(val);
}

If the issue described above is indeed forced by C++'s standard (and is not the fault of compiler writers), such PODVector will never be as efficient as raw usage of an array and an "unrelated" variable for size. And this would be quite bad for C++ as a language which wants zero-overhead abstractions.

This is because void process_value(double& ref_value); accepts the argument by reference. The compiler/optimizer assumes aliasing, i.e. that process_value function can change memory accessible through reference ref_value and hence that size member after the array.

The compiler assumes that because the array and size are members of one same object array_wrapper function process_value can potentially cast the reference to the first element (on the first invocation) to the reference to the object (and store it elsewhere) and cast the object to unsigned char and read or replace its entire representation. So that after the function returns the state of the object must be reloaded from memory.

When size is a stand-alone object on the stack the compiler/optimizer assumes that nothing else could possibly have a reference/pointer to it and caches it in a register.

In Chandler Carruth: Optimizing the Emergent Structures of C++ he explains why the optimizers have difficulty when calling functions accepting reference/pointer arguments. Use reference/pointer function arguments only when absolutely necessary.

If you would like to change the value the more performant option is:

double process_value(double value);

And then:

array_wrapper.arr[i] = process_value(array_wrapper.arr[i]);

This change results in optimal assembly:

.L23:
movsd xmm0, QWORD PTR [rbx]
add rbx, 8
call process_value2(double)
movsd QWORD PTR [rbx-8], xmm0
cmp rbx, rbp
jne .L23

Or:

for(double& val : arr)
    val = process_value(val);

Why don't C++ compilers optimize away reads an

问题:

回答1:

收藏的人(0)

Why don't C++ compilers optimize away reads an

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮