I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming
I am using the GCC compiler on Ubuntu 10.10 with an Intel Core i7 960 CPU
Here is a code based on the article which I attempted:
For two arrays of length ARRAY_SIZE it calculates
fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5
Here is the code
#include <iostream>
#include <iomanip>
#include <ctime>
#include <stdlib.h>
#include <xmmintrin.h> // Contain the SSE compiler intrinsics
#include <malloc.h>
void myssefunction(
float* pArray1, // [in] first source array
float* pArray2, // [in] second source array
float* pResult, // [out] result array
int nSize) // [in] size of all arrays
{
int nLoop = nSize/ 4;
__m128 m1, m2, m3, m4;
__m128* pSrc1 = (__m128*) pArray1;
__m128* pSrc2 = (__m128*) pArray2;
__m128* pDest = (__m128*) pResult;
__m128 m0_5 = _mm_set_ps1(0.5f); // m0_5[0, 1, 2, 3] = 0.5
for ( int i = 0; i < nLoop; i++ )
{
m1 = _mm_mul_ps(*pSrc1, *pSrc1); // m1 = *pSrc1 * *pSrc1
m2 = _mm_mul_ps(*pSrc2, *pSrc2); // m2 = *pSrc2 * *pSrc2
m3 = _mm_add_ps(m1, m2); // m3 = m1 + m2
m4 = _mm_sqrt_ps(m3); // m4 = sqrt(m3)
*pDest = _mm_add_ps(m4, m0_5); // *pDest = m4 + 0.5
pSrc1++;
pSrc2++;
pDest++;
}
}
int main(int argc, char *argv[])
{
int ARRAY_SIZE = atoi(argv[1]);
float* m_fArray1 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
float* m_fArray2 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
float* m_fArray3 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
for (int i = 0; i < ARRAY_SIZE; ++i)
{
m_fArray1[i] = ((float)rand())/RAND_MAX;
m_fArray2[i] = ((float)rand())/RAND_MAX;
}
myssefunction(m_fArray1 , m_fArray2 , m_fArray3, ARRAY_SIZE);
_aligned_free(m_fArray1);
_aligned_free(m_fArray2);
_aligned_free(m_fArray3);
return 0;
}
I get the following compiltation error
[Programming/SSE]$ g++ -g -Wall -msse sseintro.cpp
sseintro.cpp: In function ‘int main(int, char**)’:
sseintro.cpp:41: error: ‘_aligned_malloc’ was not declared in this scope
sseintro.cpp:53: error: ‘_aligned_free’ was not declared in this scope
[Programming/SSE]$
Where am I messing up? Am I missing some header files? I seem to have included all the relevant ones.
_aligned_malloc and _aligned_free are Microsoft-isms. Use posix_memalign or memalign on Linux et al. For Mac OS X you can just use malloc, as it is always 16 byte aligned. For portable SSE code you generally want to implement wrapper functions for aligned memory allocations, e.g.
Implementation of
free_simd
is left as an exercise for the reader.Short answer: use
_mm_malloc
and_mm_free
fromxmmintrin.h
instead of_aligned_malloc
and_aligned_free
.Discussion
You should not use
_aligned_malloc
,_aligned_free
,posix_memalign
,memalign
, or whatever else when you are writing SSE/AVX code. These are all compiler/platform-specific functions (either MSVC or GCC or POSIX).Intel introduced functions
_mm_malloc
and_mm_free
in Intel compiler specifically for SIMD computations (see this). The other compilers with x86 target architecture added them too (just as they add Intel intrinsics regularly). In this sense they are the only cross-platform solution: they should be available in every compiler supporting SSE.These functions are declared in
xmmintrin.h
header. Any header for later SSE/AVX version automatically includes previous ones, so it would be enough to include onlysmmintrin.h
oremmintrin.h
for example.This doesn't directly answer your question but I want point out that your SSE code is incorrectly written, I would be surprised if it works. You need to use load/store operations on non-sse types that includes aligned non-sse types like your aligned float array (you need to do this even if you have a dynamic array of SSE type). You need to keep mind that when you're working with SSE the SSE data-types are suppose to represent data in the SSE registers and every thing else is usually in system memory or non-SSE registers and thus you need to load/store from/to register and memory. This how your function should look like:
Also worth noting that you have a limit of how many registers can be used in a given time (something like 16 for SSE2). You can write code that try to use more than the limit but this will cause register spilling.