I have recently downloaded and installed the Intel C++ compiler, Composer XE 2013, for Linux which is free to use for non-commercial development. http://software.intel.com/en-us/non-commercial-software-development
I'm running on a ivy bridge system (which has AVX). I have two versions of a function which do the same thing. One does not use SSE/AVX. The other version uses AVX. In GCC the AVX code is about four times faster than the scalar code. However, with the Intel C++ compiler the performance is much worse. With GCC I compile like this
gcc m6.cpp -o m6_gcc -O3 -mavx -fopenmp -Wall -pedantic
With Intel I compile like this
icc m6.cpp -o m6_gcc -O3 -mavx -fopenmp -Wall -pedantic
I'm only using OpenMP for timing (with omp_get_wtime()
) at this point.
The strange thing is that if I change the avx option to say msse2
the code fails to compile with GCC but compiles just fine with ICC. In fact I can drop the mavx
all together and it still compiles. It seems no matter what options I try it compiles but does not make optimal use of the AVX code. So I'm wondering if I'm doing something wrong in enabling/disabling SSE/AVX with ICC?
Here is the the function with AVX that I'm using.
inline void prod_block4_unroll2_AVX(double *x, double *M, double *y, double *result) {
__m256d sum4_1 = _mm256_set1_pd(0.0f);
__m256d sum4_2 = _mm256_set1_pd(0.0f);
__m256d yrow[6];
for(int i=0; i<6; i++) {
yrow[i] = _mm256_load_pd(&y[4*i]);
}
for(int i=0; i<6; i++) {
__m256d x4 = _mm256_load_pd(&x[4*i]);
for(int j=0; j<6; j+=2) {
__m256d brod1 = _mm256_set1_pd(M[i*6 + j]);
sum4_1 = _mm256_add_pd(sum4_1, _mm256_mul_pd(_mm256_mul_pd(x4, brod1), yrow[j]));
__m256d brod2 = _mm256_set1_pd(M[i*6 + j+1]);
sum4_2 = _mm256_add_pd(sum4_2, _mm256_mul_pd(_mm256_mul_pd(x4, brod2), yrow[j+1]));
}
}
sum4_1 = _mm256_add_pd(sum4_1, sum4_2);
_mm256_store_pd(result, sum4_1);
}
Here is timing information in seconds. I run over three ranges corresponding to L1, L2, and L3 cache ranges. I only get 4x in the L1 region. Note that ICC has much faster scalar code but slower AVX code.
GCC:
nvec 2000, repeat 100000
time scalar 5.847293
time SIMD 1.463820
time scalar/SIMD 3.994543
nvec 32000, repeat 10000
time scalar 9.529597
time SIMD 2.616296
time scalar/SIMD 3.642400
difference 0.000000
nvec 5000000, repeat 100
time scalar 15.105612
time SIMD 4.530891
time scalar/SIMD 3.333917
difference -0.000000
ICC:
nvec 2000, repeat 100000
time scalar 3.715568
time SIMD 2.025883
time scalar/SIMD 1.834049
nvec 32000, repeat 10000
time scalar 6.128615
time SIMD 3.509130
time scalar/SIMD 1.746477
nvec 5000000, repeat 100
time scalar 9.844096
time SIMD 5.782332
time scalar/SIMD 1.702444