I'm trying to optimize a computation-intensive algorithm and am kind of stuck at some cache problem. I have a huge buffer which is written occasionally and at random and read only once at the end of the application. Obviously, writing into the buffer produces lots of cache misses and besides pollutes the caches which are afterwards needed again for computation. I tried to use non-temporal move instrinsics, but the cache misses (reported by valgrind and supported by runtime measurements) still occur. However, to further investigate non-temporal moves, I wrote a little test program, which you can see below. Sequential access, large buffer, only writes.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>
void tim(const char *name, void (*func)()) {
struct timespec t1, t2;
clock_gettime(CLOCK_REALTIME, &t1);
func();
clock_gettime(CLOCK_REALTIME, &t2);
printf("%s : %f s.\n", name, (t2.tv_sec - t1.tv_sec) + (float) (t2.tv_nsec - t1.tv_nsec) / 1000000000);
}
const int CACHE_LINE = 64;
const int FACTOR = 1024;
float *arr;
int length;
void func1() {
for(int i = 0; i < length; i++) {
arr[i] = 5.0f;
}
}
void func2() {
for(int i = 0; i < length; i += 4) {
arr[i] = 5.0f;
arr[i+1] = 5.0f;
arr[i+2] = 5.0f;
arr[i+3] = 5.0f;
}
}
void func3() {
__m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for(int i = 0; i < length; i += 4) {
_mm_stream_ps(&arr[i], buf);
}
}
void func4() {
__m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for(int i = 0; i < length; i += 16) {
_mm_stream_ps(&arr[i], buf);
_mm_stream_ps(&arr[4], buf);
_mm_stream_ps(&arr[8], buf);
_mm_stream_ps(&arr[12], buf);
}
}
int main() {
length = CACHE_LINE * FACTOR * FACTOR;
arr = malloc(length * sizeof(float));
tim("func1", func1);
free(arr);
arr = malloc(length * sizeof(float));
tim("func2", func2);
free(arr);
arr = malloc(length * sizeof(float));
tim("func3", func3);
free(arr);
arr = malloc(length * sizeof(float));
tim("func4", func4);
free(arr);
return 0;
}
Function 1 is the naive approach, function 2 uses loop unrolling. Function 3 uses movntps, which in fact was inserted in the assembly at least when I checked for -O0. In function 4 I tried to issue several movntps instructions at once to help the CPU do its write combining. I compiled the code with gcc -g -lrt -std=gnu99 -OX -msse4.1 test.c
where X
is one of [0..3]. The results are .. interesting to say at best:
-O0
func1 : 0.407794 s.
func2 : 0.320891 s.
func3 : 0.161100 s.
func4 : 0.401755 s.
-O1
func1 : 0.194339 s.
func2 : 0.182536 s.
func3 : 0.101712 s.
func4 : 0.383367 s.
-O2
func1 : 0.108488 s.
func2 : 0.088826 s.
func3 : 0.101377 s.
func4 : 0.384106 s.
-O3
func1 : 0.078406 s.
func2 : 0.084927 s.
func3 : 0.102301 s.
func4 : 0.383366 s.
As you can see _mm_stream_ps is a little faster than the others when the program is not optimized by gcc but then significantly fails its purpose when gcc optimization is turned on. Valgrind still reports lots of cache write misses.
So, questions are: Why do those (L1+LL) cache misses still occur even if I'm using NTA streaming instructions? Why is especially func4 so slow?! Can someone explain/speculate what is happening here?
Shouldn't func4 be this:
malloc
, but on first touch, inside yourfunc*
functions. OS may also do some memory shuffles after large amount of memory is allocated, so any benchmarks, performed just after memory allocations, may be not reliable.arr
value from memory instead of using a register. This may cost some performance decrease. Easiest way to avoid aliasing is to copyarr
andlength
to local variables and use only local variables to fill the array. There are many well-known advices to avoid global variables. Aliasing is one of the reasons._mm_stream_ps
works better if array is aligned by 64 bytes. In your code no alignment is guaranteed (actually,malloc
aligns it by 16 bytes). This optimization is noticeable only for short arrays._mm_mfence
after you finished with_mm_stream_ps
. This is needed for correctness, not for performance.