Fastest method of vectorized integer division by n

Based on the answers/comments of this question i wrote a performance test with gcc 4.9.2 (MinGW64) to estimate which way of multiple integer division is faster, as following:

#include <emmintrin.h>  // SSE2

static unsigned short x[8] = {0, 55, 2, 62003, 786, 5555, 123, 32111};  // Dividend

__attribute__((noinline)) static void test_div_x86(unsigned i){
    for(; i; --i)
        x[0] /= i,
        x[1] /= i,
        x[2] /= i,
        x[3] /= i,
        x[4] /= i,
        x[5] /= i,
        x[6] /= i,
        x[7] /= i;
}

__attribute__((noinline)) static void test_div_sse(unsigned i){
    for(; i; --i){
        __m128i xmm0 = _mm_loadu_si128((const __m128i*)x);
        __m128 xmm1 = _mm_set1_ps(i);
        _mm_storeu_si128(
            (__m128i*)x,
            _mm_packs_epi32(
                _mm_cvtps_epi32(
                    _mm_div_ps(
                        _mm_cvtepi32_ps(_mm_unpacklo_epi16(xmm0, _mm_setzero_si128())),
                        xmm1
                    )
                ),
                _mm_cvtps_epi32(
                    _mm_div_ps(
                        _mm_cvtepi32_ps(_mm_unpackhi_epi16(xmm0, _mm_setzero_si128())),
                        xmm1
                    )
                )
            )
        );
    }
}

int main(){
    const unsigned runs = 40000000; // Choose a big number, so the compiler doesn't dare to unroll loops and optimize with constants
    test_div_x86(runs),
    test_div_sse(runs);
    return 0;
}

The results by GNU Gprof and tools parameters.

/*
gcc -O? -msse2 -pg -o test.o -c test.c
g++ -o test test.o -pg
test
gprof test.exe gmon.out
-----------------------------------
        test_div_sse(unsigned int)      test_div_x86(unsigned int)
-O0     2.26s                           1.10s
-O1     1.41s                           1.07s
-O2     0.95s                           1.09s
-O3     0.77s                           1.07s
*/

Now i'm confused why the x86 test barely gets optimized and the SSE test becomes faster though the expensive conversion to & from floating point. Furthermore i'd like to know how much results depend on compilers and architectures.

To summarize it: what's faster at the end: dividing one-by-one or the floating-point detour?

Dividing all elements of a vector by the same scalar can be done with integer multiply and shift. libdivide (C/C++, zlib license) provides some inline functions to do this for scalars (e.g. int), and for dividing vectors by scalars. Also see SSE integer division? (as you mention in your question) for a similar technique giving approximate results. It's more efficient if the same scalar will be applied to lots of vectors. libdivide doesn't say anything about the results being inexact, but I haven't investigated.

re: your code: You have to be careful about checking what the compiler actually produces, when giving it a trivial loop like that. e.g. is it actually loading/storing back to RAM every iteration? Or is it keeping variables live in registers, and only storing at the end?

Your benchmark is skewed in favour of the integer-division loop, because the vector divider isn't kept 100% occupied in the vector loop, but the integer divider is kept 100% occupied in the int loop. (These paragraphs were added after the discussion in comments. The previous answer didn't explain as much about keeping the dividers fed, and dependency chains.)

You only have a single dependency chain in your vector loop, so the vector divider sits idle for several cycles every iteration after producing the 2nd result, while the chain of convert fp->si, pack, unpack, convert si->fp happens. You've set things up so your throughput is limited by the length of the entire loop-carried dependency chain, rather than the throughput of the FP dividers. If the data each iteration was independent (or there were at least several independent values, like how you have 8 array elements for the int loop), then the unpack/convert and convert/pack of one set of values would overlap with the divps execution time for another vector. The vector divider is only partially pipelined, but everything else if fully pipelined.

This is the difference between throughput and latency, and why it matters for a pipelined out-of-order execution CPU.

Other stuff in your code:

You have __m128 xmm1 = _mm_set1_ps(i); in the inner loop. _set1 with an arg that isn't a compile-time constant is usually at least 2 instructions: movd and pshufd. And in this case, an int-to-float conversion, too. Keeping a float-vector version of your loop counter, which you increment by adding a vector of 1.0, would be better. (Although this probably isn't throwing off your speed test any further, because this excess computation can overlap with other stuff.)

Unpacking with zero works fine. SSE4.1 __m128i _mm_cvtepi16_epi32 (__m128i a) is another way. pmovsxwd is the same speed, but doesn't need a zeroed register.

If you're going to convert to FP for divide, have you considered just keeping your data as FP for a while? Depends on your algorithm how you need rounding to happen.

performance on recent Intel CPUs

divps (packed single float) is 10-13 cycle latency, with a throughput of one per 7 cycles, on recent Intel designs. div / idiv r16 ((unsigned) integer divide in GP reg) is 23-26 cycle latency, with one per 9 or 8 cycle throughput. div is 11 uops, so it even gets in the way of other things issuing / executing for some of the time it's going through the pipeline. (divps is a single uop.) So, Intel CPUs are not really designed to be fast at integer division, but make an effort for FP division.

So just for the division alone, a single integer division is slower than a vector FP division. You're going to come out ahead even with the conversion to/from float, and the unpack/pack.

If you can do the other integer ops in vector regs, that would be ideal. Otherwise you have to get the integers into / out of vector regs. If the ints are in RAM, a vector load is fine. If you're generating them one at a time, PINSRW is an option, but it's possible that just storing to memory to set up for a vector load would be a faster way to load a full vector. Similar for getting the data back out, with PEXTRW or by storing to RAM. If you want the values in GP registers, skip the pack after converting back to int, and just MOVD / PEXTRD from whichever of the two vector regs your value is in. insert/extract instructions take two uops on Intel, which means they take up two "slots", compared to most instructions taking only one fused-domain uop.

Your timing results, showing that the scalar code doesn't improve with compiler optimizations, is because the CPU can overlap the verbose non-optimized load/store instructions for other elements while the divide unit is the bottleneck. The vector loop on the other hand only has one or two dependency chains, with every iteration dependent on the previous, so extra instructions adding latency can't be overlapped with anything. Testing with -O0 is pretty much never useful.