I am trying to profile my C++ code using perf tool. Implementation contains code with SSE/AVX/AVX2 instructions. In addition to that code is compiled with -O3 -mavx2 -march=native
flags. I believe __memset_avx2_unaligned_erms
function is a libc implementation of memset
. perf shows that this function has considerable overhead. Function name indicates that memory is unaligned, however in the code I am explicitly aligning the memory using GCC built-in macro __attribute__((aligned (x)))
What might be the reason for this function to have significant overhead and also why unaligned version is called although memory is aligned explicitly?
I have attached the sample report as picture.
No, it doesn't. It means the memset strategy chosen by glibc on that hardware is one that doesn't try to avoid aligned accesses entirely, in the small-size cases. (glibc selects a memset implementation at dynamic linker symbol resolution time, so it gets runtime dispatching with no extra overhead after the first call.)
If your buffer is in fact aligned and the size is a multiple of the vector width, all the accesses will be aligned and there's essentially no overhead. (Using vmovdqu
with a pointer that happens to be aligned at runtime is exactly equivalent to vmovdqa
on all CPUs that support AVX.)
For large buffers, it still aligns the pointer before the main loop in case it isn't aligned, at the cost of a couple extra instructions vs. an implementation that only worked for 32-byte aligned pointers. (But it looks like it uses rep stosb
without aligning the pointer, if it's going to rep stosb
at all.)
gcc+glibc doesn't have a special version of memset that's only called with aligned pointers. (Or multiple special versions for different alignment guarantees). GLIBC's AVX2-unaligned implementation works nicely for both aligned and unaligned inputs.
It's defined in glibc/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
, which defines a couple macros (like defining the vector size as 32) and then #includes "memset-vec-unaligned-erms.S"
.
The comment in the source code says:
/* memset is implemented as:
1. Use overlapping store to avoid branch.
2. If size is less than VEC, use integer register stores.
3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
5. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
4 VEC stores and store 4 * VEC at a time until done. */
The actual alignment before the main loop is done after some vmovdqu
vector stores (which have no penalty if used on data that is in fact aligned: https://agner.org/optimize/):
L(loop_start):
leaq (VEC_SIZE * 4)(%rdi), %rcx # rcx = input pointer + 4*VEC_SIZE
VMOVU %VEC(0), (%rdi) # store the first vector
andq $-(VEC_SIZE * 4), %rcx # align the pointer
... some more vector stores
... and stuff, including storing the last few vectors I think
addq %rdi, %rdx # size += start, giving an end-pointer
andq $-(VEC_SIZE * 4), %rdx # align the end-pointer
L(loop): # THE MAIN LOOP
VMOVA %VEC(0), (%rcx) # vmovdqa = alignment required
VMOVA %VEC(0), VEC_SIZE(%rcx)
VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx)
VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx)
addq $(VEC_SIZE * 4), %rcx
cmpq %rcx, %rdx
jne L(loop)
So with VEC_SIZE = 32, it aligns the pointer by 128. This is overkill; cache lines are 64 bytes, and really just aligning to the vector width should be fine.
It also has a threshold for using rep stos
if enabled and the buffer size is > 2kiB, on CPUs with ERMSB. (Enhanced REP MOVSB for memcpy).