Using SSE instructions

2019-01-31 00:44发布

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?

15条回答
何必那么认真
2楼-- · 2019-01-31 01:39

Have a look at inline assembler for C/C++, here is a DDJ article. Unless you are 100% certain your program will run on a compatible platform you should follow the recommendations many have given here.

查看更多
唯我独甜
3楼-- · 2019-01-31 01:41

SIMD intrinsics (such as SSE2) can speed this sort of thing up but take expertise to use correctly. They are very sensitive to alignment and pipeline latency; careless use can make performance even worse than it would have been without them. You'll get a much easier and more immediate speedup from simply using cache prefetching to make sure all your ints are in L1 in time for you to operate on them.

Unless your function needs a throughput of better than 100,000,000 integers per second, SIMD probably isn't worth the trouble for you.

查看更多
Deceive 欺骗
4楼-- · 2019-01-31 01:43

Although it is true that SSE is specific to some processors (SSE may be relatively safe, SSE2 much less in my experience), you can detect the CPU at runtime, and load the code dynamically depending on the target CPU.

查看更多
女痞
5楼-- · 2019-01-31 01:43

Just to add briefly to what has been said before about different SSE versions being available on different CPUs: This can be checked by looking at the respective feature flags returned by the CPUID instruction (see e.g. Intel's documentation for details).

查看更多
冷血范
6楼-- · 2019-01-31 01:45

Write code that helps the compiler understand what you are doing. GCC will understand and optimize SSE code such as this:

typedef union Vector4f
{
        // Easy constructor, defaulted to black/0 vector
    Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f):
        X(a), Y(b), Z(c), W(d) { }

        // Cast operator, for []
    inline operator float* ()
    { 
        return (float*)this;
    }

        // Const ast operator, for const []
    inline operator const float* () const
    { 
        return (const float*)this;
    }

    // ---------------------------------------- //

    inline Vector4f operator += (const Vector4f &v)
    {
        for(int i=0; i<4; ++i)
            (*this)[i] += v[i];

        return *this;
    }

    inline Vector4f operator += (float t)
    {
        for(int i=0; i<4; ++i)
            (*this)[i] += t;

        return *this;
    }

        // Vertex / Vector 
        // Lower case xyzw components
    struct {
        float x, y, z;
        float w;
    };

        // Upper case XYZW components
    struct {
        float X, Y, Z;
        float W;
    };
};

Just don't forget to have -msse -msse2 on your build parameters!

查看更多
走好不送
7楼-- · 2019-01-31 01:47

I don't recommend doing this yourself unless you're fairly proficient with assembly. Using SSE will, more than likely, require careful reorganization of your data, as Skizz points out, and the benefit is often questionable at best.

It would probably be much better for you to write very small loops and keep your data very tightly organized and just rely on the compiler doing this for you. Both the Intel C Compiler and GCC (since 4.1) can auto-vectorize your code, and will probably do a better job than you. (Just add -ftree-vectorize to your CXXFLAGS.)

Edit: Another thing I should mention is that several compilers support assembly intrinsics, which would probably, IMO, be easier to use than the asm() or __asm{} syntax.

查看更多
登录 后发表回答