I'm trying to create a local array of some POD values (e.g. double
) with fixed max_size
that is known at compile time, then read a runtime size
value (size <= max_size
) and process first size
elements from that array.
The question is, why doesn't compiler eliminate stack reads and writes when arr
and size
are placed into the same struct
/class
, as opposed to the case where arr
and size
are independent local variables?
Here's my code:
#include <cstddef>
constexpr std::size_t max_size = 64;
extern void process_value(double& ref_value);
void test_distinct_array_and_size(std::size_t size)
{
double arr[max_size];
std::size_t arr_size = size;
for (std::size_t i = 0; i < arr_size; ++i)
process_value(arr[i]);
}
void test_array_and_size_in_local_struct(std::size_t size)
{
struct
{
double arr[max_size];
std::size_t size;
} array_wrapper;
array_wrapper.size = size;
for (std::size_t i = 0; i < array_wrapper.size; ++i)
process_value(array_wrapper.arr[i]);
}
Assembly output for test_distinct_array_and_size
from Clang with -O3:
test_distinct_array_and_size(unsigned long): # @test_distinct_array_and_size(unsigned long)
push r14
push rbx
sub rsp, 520
mov r14, rdi
test r14, r14
je .LBB0_3
mov rbx, rsp
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov rdi, rbx
call process_value(double&)
add rbx, 8
dec r14
jne .LBB0_2
.LBB0_3:
add rsp, 520
pop rbx
pop r14
ret
Assembly output for test_array_and_size_in_local_struct
:
test_array_and_size_in_local_struct(unsigned long): # @test_array_and_size_in_local_struct(unsigned long)
push r14
push rbx
sub rsp, 520
mov qword ptr [rsp + 512], rdi
test rdi, rdi
je .LBB1_3
mov r14, rsp
xor ebx, ebx
.LBB1_2: # =>This Inner Loop Header: Depth=1
mov rdi, r14
call process_value(double&)
inc rbx
add r14, 8
cmp rbx, qword ptr [rsp + 512]
jb .LBB1_2
.LBB1_3:
add rsp, 520
pop rbx
pop r14
ret
Latest GCC and MSVC compilers do basically the same thing with the stack reads and writes.
As we can see, reads and writes to the array_wrapper.size
variable on the stack are not optimized away in the latter case. There is a write of size
value into location [rsp + 512]
before the beginning of the loop, and a read from that location after each iteration.
So, the compiler kinda expects that we'd want to modify array_wrapper.size
from the process_value(array_wrapper.arr[i])
call (by taking the address of the current array element and applying some weird offsets to it?)
But, if we tried to do so from that call, woudn't that be undefined behavior?
When we rewrite the loop in the following manner
for (std::size_t i = 0, sz = array_wrapper.size; i < sz; ++i)
process_value(array_wrapper.arr[i]);
, those unnecessary reads at the end of each iteration will be gone. But the initial write to [rsp + 512]
will remain, meaning that compiler still expects us to be able to access the array_wrapper.size
variable in that location from these process_value
calls (by doing some weird offset-based magic).
Why?
Is that but a little shortcoming in modern compilers' implementations (that hopefully will be fixed soon)? Or does the C++ standard indeed require such behavior that leads to generation of less efficient code whenever we put an array and its size into the same class?
P.S.
I realize that my code example above might seem a bit contrived. But consider this: I'd like to use a lightweight boost::container::static_vector
-like class template in my code for safer and more convenient "C++-style" manipulations with pseudo-dynamic arrays of POD elements. So my PODVector
will contain an array and a size_t
in the same class:
template<typename T, std::size_t MaxSize>
class PODVector
{
static_assert(std::is_pod<T>::value, "T must be a POD type");
private:
T _data[MaxSize];
std::size_t _size = 0;
public:
using iterator = T *;
public:
static constexpr std::size_t capacity() noexcept
{
return MaxSize;
}
constexpr PODVector() noexcept = default;
explicit constexpr PODVector(std::size_t initial_size)
: _size(initial_size)
{
assert(initial_size <= capacity());
}
constexpr std::size_t size() const noexcept
{
return _size;
}
constexpr void resize(std::size_t new_size)
{
assert(new_size <= capacity());
_size = new_size;
}
constexpr iterator begin() noexcept
{
return _data;
}
constexpr iterator end() noexcept
{
return _data + _size;
}
constexpr T & operator[](std::size_t position)
{
assert(position < _size);
return _data[position];
}
};
Usage:
void test_pod_vector(std::size_t size)
{
PODVector<double, max_size> arr(size);
for (double& val : arr)
process_value(val);
}
If the issue described above is indeed forced by C++'s standard (and is not the fault of compiler writers), such PODVector
will never be as efficient as raw usage of an array and an "unrelated" variable for size. And this would be quite bad for C++ as a language which wants zero-overhead abstractions.