可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:

    int getDiff(int indx1, int indx2) {
    int result = 0;
    int pplus, pminus, tmp;

    for (int k = 0; k < 128; k += 2) {
        pplus = nodeL[indx2][k] - nodeL[indx1][k];
        pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];

        tmp = max(pplus, pminus);
        if (tmp > result) {
            result = tmp;
        }
    }
    return result;
}

As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.

Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance

Benchmark:

    int nofTestCases = 10000000;

    vector<int> nodeIds(nofTestCases);
    vector<int> goalNodeIds(nofTestCases);
    vector<int> results(nofTestCases);

    for (int l = 0; l < nofTestCases; l++) {
        nodeIds[l] = randomNodeID(18000000);
        goalNodeIds[l] = randomNodeID(18000000);
    }



    double time, result;

    time = timestamp();
    for (int l = 0; l < nofTestCases; l++) {
        results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
    }
    result = timestamp() - time;
    cout << result / nofTestCases << "s" << endl;

    time = timestamp();
    for (int l = 0; l < nofTestCases; l++) {
        results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
    }
    result = timestamp() - time;
    cout << result / nofTestCases << "s" << endl;

where

int randomNodeID(int n) {
    return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}

/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}

回答1:

FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:

#include <smmintrin.h>

int getDiff_SSE(int indx1, int indx2)
{
    int result[4] __attribute__ ((aligned(16))) = { 0 };

    const int * const p1 = &nodeL[indx1][0];
    const int * const p2 = &nodeL[indx2][0];

    const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
    const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);

    __m128i vresult = _mm_set1_epi32(0);

    for (int k = 0; k < 128; k += 4)
    {
        __m128i v1, v2, vmax;

        v1 = _mm_loadu_si128((__m128i *)&p1[k]);
        v2 = _mm_loadu_si128((__m128i *)&p2[k]);
        v1 = _mm_xor_si128(v1, vke);
        v2 = _mm_xor_si128(v2, vko);
        v1 = _mm_sub_epi32(v1, vke);
        v2 = _mm_sub_epi32(v2, vko);
        vmax = _mm_add_epi32(v1, v2);
        vresult = _mm_max_epi32(vresult, vmax);
    }
    _mm_store_si128((__m128i *)result, vresult);
    return max(max(max(result[0], result[1]), result[2]), result[3]);
}

回答2:

You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.

The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.

There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.

It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.

回答3:

This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.

int getDiff(int indx1, int indx2) {
    int result = 0;
    int pplus, pminus, tmp;

    const vector<int>& nodetemp1 = nodeL[indx1];
    const vector<int>& nodetemp2 = nodeL[indx2];

    for (int k = 0; k < 128; k += 2) {
        pplus = nodetemp2[k] - nodetemp1[k];
        pminus = nodetemp1[k + 1] - nodetemp2[k + 1];

        tmp = max(pplus, pminus);
        if (tmp > result) {
            result = tmp;
        }
    }
    return result;
}

回答4:

A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.

I've tried to rewrite it using SSE instructions (AVX) using library here

The original code on my system ran in 11.5s With Neil Kirk's optimisation, it went down to 10.5s

EDIT: Tested the code with a debugger rather than in my head!

int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
    Vec4i result(0);
    const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];

Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
    Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
    Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
    Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
    Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
    Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
    Vec4i max_tmp = max(tmp_a,tmp_b);
    result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);

}

The lack of branching speeds it up to 9.5s but still data is the biggest impact.

If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.

EDIT I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure

EDIT 2: Tested the code with a debugger rather than in my head! Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.