I see people using -msse -msse2 -mfpmath=sse
flags by default hoping that this will improve performance. I know that SSE gets engaged when special vector types are used in the C code. But do these flags make any difference for regular C code? Does compiler use SSE to optimize regular C code?
问题:
回答1:
Yes, modern compilers auto-vectorize with SSE2 if you compile with full optimization. clang enables it even at -O2, gcc at -O3.
Even at -O1 or -Os, compilers will use SIMD load/store instructions to copy or initialize structs or other objects wider than an integer register. That doesn't really count as auto-vectorization; it's more like part of their default builtin memset / memcpy strategy for small fixed-size blocks. But it does take advantage of and require SIMD instructions to be supported.
SSE2 is baseline / non-optional for x86-64, so compilers can always use SSE1/SSE2 instructions when targeting x86-64. Later instruction sets (SSE4, AVX, AVX2, AVX512, and non-SIMD extensions like BMI2, popcnt, etc.) have to be enabled manually to tell the compiler it's ok to make code that won't run on older CPUs. Or to get it to generate multiple versions of code and choose at runtime, but that has extra overhead and is only worth it for larger functions.
-msse -msse2 -mfpmath=sse
is already the default for x86-64, but not for 32-bit i386. Some 32-bit calling conventions return FP values in x87 registers, so it can be inconvenient to use SSE/SSE2 for computation and then have to store/reload the result to get it in x87 st(0)
. With -mfpmath=sse
, smarter compilers might still use x87 for a calculation that produces an FP return value.
On 32-bit x86, -msse2
might not be on by default, it depends on how your compiler was configured. If you're using 32-bit because you're targeting CPUs that are so old they can't run 64-bit code, you might want to make sure it's disabled, or only -msse
.
The best way to make a binary tuned for the CPU you're compiling on is -O3 -march=native -mfpmath=sse
, and use link-time optimization + profile-guided optimization. (gcc -fprofile-generate
/ run on some test data / gcc -fprofile-use
).
Using -march=native
makes binaries that might not run on earlier CPUs, if the compiler does choose to use new instructions. Profile-guided optimization is very helpful for gcc: it never unrolls loops without it. But with PGO, it knows which loops run often / for a lot of iterations, i.e. which loops are "hot" and worth spending more code-size on. Link-time optimization allows inlining / constant-propagation across files. It's very helpful if you have C++ with a lot of small functions that you don't actually define in header files.
See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output and making sense of it.
Here are some specific examples on the Godbolt compiler explorer for x86-64. Godbolt also has gcc for several other architectures, and with clang you can add -target mips
or whatever, so you can also see auto-vectorization for ARM NEON with the right compiler options to enable it. You can use -m32
with the x86-64 compilers to get 32-bit code-gen.
int sumint(int *arr) {
int sum = 0;
for (int i=0 ; i<2048 ; i++){
sum += arr[i];
}
return sum;
}
inner loop with gcc8.1 -O3
(without -march=haswell
or anything to enable AVX/AVX2):
.L2: # do {
movdqu xmm2, XMMWORD PTR [rdi] # load 16 bytes
add rdi, 16
paddd xmm0, xmm2 # packed add of 4 x 32-bit integers
cmp rax, rdi
jne .L2 # } while(p != endp)
# then horizontal add and extract a single 32-bit sum
Without -ffast-math
, compilers can't reorder FP operations, so the float
equivalent don't auto-vectorize (see the Godbolt link: you get scalar addss
). (OpenMP can enable it on a per-loop basis, or use -ffast-math
).
But some FP stuff can safely auto-vectorize without changing order of operations.
// clang won't contract this into an FMA without -ffast-math :/
// but gcc will (if you compile with -march=haswell)
void scale_array(float *arr) {
for (int i=0 ; i<2048 ; i++){
arr[i] = arr[i] * 2.1f + 1.234f;
}
}
# load constants: xmm2 = {2.1, 2.1, 2.1, 2.1}
# xmm1 = (1.23, 1.23, 1.23, 1.23}
.L9: # gcc8.1 -O3 # do {
movups xmm0, XMMWORD PTR [rdi] # load unaligned packed floats
add rdi, 16
mulps xmm0, xmm2 # multiply Packed Single-precision
addps xmm0, xmm1 # add Packed Single-precision
movups XMMWORD PTR [rdi-16], xmm0 # store back to the array
cmp rax, rdi
jne .L9 # }while(p != endp)
multiplier = 2.0f
results in using addps
to double, cutting throughput by a factor of 2 on Haswell / Broadwell! Because before SKL, FP add only runs on one execution port, but there are two FMA units that can run multiplies. SKL dropped the adder and runs add with the same 2 per clock throughput and latency as mul and FMA. (http://agner.org/optimize/, and see other performance links in the x86 tag wiki.)
Compiling with -march=haswell
lets the compiler use a single FMA for the scale + add. (But clang won't contract the expression into an FMA unless you use -ffast-math
. IIRC there's an option to enable FP contraction without other aggressive operations.)
回答2:
That's impossible to answer generally. For some specific C source and compiler, however, you can answer that by looking at the generated assembly. Almost any compiler should have an option to create assembly files. Then you can search for SSE instructions.
For most Unix C compilers, use the -S
option. For details Read The Fine Manual of your compiler.