Compilation of a simple c++ program using SSE intr

2019-01-24 16:11发布

I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming

I am using the GCC compiler on Ubuntu 10.10 with an Intel Core i7 960 CPU

Here is a code based on the article which I attempted:

For two arrays of length ARRAY_SIZE it calculates

fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5

Here is the code

#include <iostream>
#include <iomanip>
#include <ctime>
#include <stdlib.h>
#include <xmmintrin.h> // Contain the SSE compiler intrinsics
#include <malloc.h>
void myssefunction(
          float* pArray1,                   // [in] first source array
          float* pArray2,                   // [in] second source array
          float* pResult,                   // [out] result array
          int nSize)                        // [in] size of all arrays
{
    int nLoop = nSize/ 4;

    __m128 m1, m2, m3, m4;

    __m128* pSrc1 = (__m128*) pArray1;
    __m128* pSrc2 = (__m128*) pArray2;
    __m128* pDest = (__m128*) pResult;


    __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5

    for ( int i = 0; i < nLoop; i++ )
    {
        m1 = _mm_mul_ps(*pSrc1, *pSrc1);        // m1 = *pSrc1 * *pSrc1
        m2 = _mm_mul_ps(*pSrc2, *pSrc2);        // m2 = *pSrc2 * *pSrc2
        m3 = _mm_add_ps(m1, m2);                // m3 = m1 + m2
        m4 = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
        *pDest = _mm_add_ps(m4, m0_5);          // *pDest = m4 + 0.5

        pSrc1++;
        pSrc2++;
        pDest++;
    }
}

int main(int argc, char *argv[])
{
  int ARRAY_SIZE = atoi(argv[1]);
  float* m_fArray1 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
  float* m_fArray2 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
  float* m_fArray3 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);

  for (int i = 0; i < ARRAY_SIZE; ++i)
    {
      m_fArray1[i] = ((float)rand())/RAND_MAX;
      m_fArray2[i] = ((float)rand())/RAND_MAX;
    }

  myssefunction(m_fArray1 , m_fArray2 , m_fArray3, ARRAY_SIZE);

  _aligned_free(m_fArray1);
   _aligned_free(m_fArray2);
   _aligned_free(m_fArray3);

  return 0;
}

I get the following compiltation error

[Programming/SSE]$ g++ -g -Wall -msse sseintro.cpp 
sseintro.cpp: In function ‘int main(int, char**)’:
sseintro.cpp:41: error: ‘_aligned_malloc’ was not declared in this scope
sseintro.cpp:53: error: ‘_aligned_free’ was not declared in this scope
[Programming/SSE]$ 

Where am I messing up? Am I missing some header files? I seem to have included all the relevant ones.

标签: c++ x86 sse simd
3条回答
走好不送
2楼-- · 2019-01-24 16:43

_aligned_malloc and _aligned_free are Microsoft-isms. Use posix_memalign or memalign on Linux et al. For Mac OS X you can just use malloc, as it is always 16 byte aligned. For portable SSE code you generally want to implement wrapper functions for aligned memory allocations, e.g.

void * malloc_simd(const size_t size)
{
#if defined WIN32           // WIN32
    return _aligned_malloc(size, 16);
#elif defined __linux__     // Linux
    return memalign(16, size);
#elif defined __MACH__      // Mac OS X
    return malloc(size);
#else                       // other (use valloc for page-aligned memory)
    return valloc(size);
#endif
}

Implementation of free_simd is left as an exercise for the reader.

查看更多
爷的心禁止访问
3楼-- · 2019-01-24 16:46

Short answer: use _mm_malloc and _mm_free from xmmintrin.h instead of _aligned_malloc and _aligned_free.

Discussion

You should not use _aligned_malloc, _aligned_free, posix_memalign, memalign, or whatever else when you are writing SSE/AVX code. These are all compiler/platform-specific functions (either MSVC or GCC or POSIX).

Intel introduced functions _mm_malloc and _mm_free in Intel compiler specifically for SIMD computations (see this). The other compilers with x86 target architecture added them too (just as they add Intel intrinsics regularly). In this sense they are the only cross-platform solution: they should be available in every compiler supporting SSE.

These functions are declared in xmmintrin.h header. Any header for later SSE/AVX version automatically includes previous ones, so it would be enough to include only smmintrin.h or emmintrin.h for example.

查看更多
来,给爷笑一个
4楼-- · 2019-01-24 16:53

This doesn't directly answer your question but I want point out that your SSE code is incorrectly written, I would be surprised if it works. You need to use load/store operations on non-sse types that includes aligned non-sse types like your aligned float array (you need to do this even if you have a dynamic array of SSE type). You need to keep mind that when you're working with SSE the SSE data-types are suppose to represent data in the SSE registers and every thing else is usually in system memory or non-SSE registers and thus you need to load/store from/to register and memory. This how your function should look like:

void myssefunction
(
    float* pArray1,                   // [in] first source array
    float* pArray2,                   // [in] second source array
    float* pResult,                   // [out] result array
    int nSize                         // [in] size of all arrays
)                                   
{
    const __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5
    for (size_t index = 0; index < nSize; index += 4)
    {
        __m128 pSrc1 = _mm_load_ps(pArray1 + index); // load 4 elements from memory into SSE register
        __m128 pSrc2 = _mm_load_ps(pArray2 + index); // load 4 elements from memory into SSE register

        __m128 m1   = _mm_mul_ps(pSrc1, pSrc1);        // m1 = *pSrc1 * *pSrc1
        __m128 m2   = _mm_mul_ps(pSrc2, pSrc2);        // m2 = *pSrc2 * *pSrc2
        __m128 m3   = _mm_add_ps(m1, m2);                // m3 = m1 + m2
        __m128 m4   = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
        __m128 pDest  = _mm_add_ps(m4, m0_5);          // pDest = m4 + 0.5

        _mm_store_ps(pResult + index, pDest); // store 4 elements from SSE register to memory.
    }
}

Also worth noting that you have a limit of how many registers can be used in a given time (something like 16 for SSE2). You can write code that try to use more than the limit but this will cause register spilling.

查看更多
登录 后发表回答