memcpy moving 128 bit in linux

2020-07-20 03:53发布

问题:

I'm writing a device driver in linux for a PCIe device. This device driver performs several read and write to test the throughput. When I use the memcpy, the maximum payload for a TLP is 8 bytes ( on 64 bits architectures ). In my opinion the only way to get a payload of 16 bytes is to use the SSE instruction set. I've already seen this but the code doesn't compile ( AT&T/Intel syntax issue ).

  • There is a way to use that code inside linux ?
  • Does anyone know where I can found an implementation of a memcpy that moves 128 bits ?

回答1:

First of all you probably use GCC as the compiler and it uses the asm statement for inline assembler. When using that you will have to use a string literal for the assembler code (which will be copied into the assembler code before sending to the assembler - this means that the string should contain newline characters).

Second you will probably have to use AT&T syntax for the assembler.

Third GCC uses extended asm to pass variables between assembler and C.

Fourth you should probably avoid inline assembler when possible anyway as the compiler wont have the possibility to schedule instructions past an asm statement (this was true at least). Instead you could maybe make use of GCC extensions like the vector_size attribute:

typedef float v4sf __attribute__((vector_size(16)));

void fubar( v4sf *p, v4sf* q )
{
  v4sf p0 = *p++;
  v4sf p1 = *p++;
  v4sf p2 = *p++;
  v4sf p3 = *p++;

  *q++ = p0;
  *q++ = p1;
  *q++ = p2;
  *q++ = p3;
}

has the advantage that the compiler will produce code even if you compile for a processor that doesn't have the mmx registers, but perhaps some other 128-bit registers (or doesn't have vector registers at all).

Fifth you should investigate if the provided memcpy isn't fast enough. Often the memcpy is really optimized.

Sixth you should take precaution if you're using special registers in the Linux kernel, there are registers that aren't saved during context switch. The SSE registers are a part of these.

Seventh as you using this to test throughput you should consider if the processor is a significant bottleneck in the equation. Compare the actual execution of the code with the reads from/writes to RAM (do you hit or miss the cache?) or the reads from/write to the peripheral.

Eighth when moving data you should avoid moving big chunks of data from RAM to RAM and if it's to/from a peripheral that has limited bandwidth you should definitely consider using DMA for that. Remember that if it's access time that limits the performance the CPU will still be considered busy (although it can't run at 100% speed).



回答2:

Leaving this answer here for now, even though it's now clear the OP just wants a single 16B transfer. On Linux, his code is causing two 8B transfers over the PCIe bus.

For writing to MMIO space, it's worth trying movnti write-combining-store instructions. The source operand for movnti is a GP register, not a vector reg.

You can probably generate that with intrinsics, if you #include <immintrin.h> in your driver code. That should be fine in the kernel, as long as you're careful about what intrinsics you use. It doesn't define any globals.


So most of this section isn't very relevant.

On most CPUs (where rep movs is good), Linux's memcpy uses it. It only uses a fallback to an explicit loop for CPUs where rep movsq or rep movsb are not good choices.

When the size is a compile-time-constant, memcpy has an inline implementation using rep movsl (AT&T syntax for rep movsd), then for cleanup: non-rep movsw and movsb if needed. (Actually kinda clunky, IMO, since the size is a compile-time constant. Also doesn't take advantage of fast rep movsb on CPUs that have it.)

Intel CPUs since P6 have had at least fairly good rep movs implementations. See Andy Glew's comments on it.

But still, you're wrong about memcpy only moving in 64bit blocks, unless I'm misreading the code or you're on a platform where it decides to use the fallback loop.

Anyway, I don't think you're missing out on much perf by using the normal Linux memcpy, unless you've actually single-stepped your code and seen it doing something silly.

For large copies, you'll want to set up DMA anyway. CPU usage by your driver is important, not just the max throughput you can obtain on an otherwise-idle system. (Be careful of trusting microbenchmarks too much.)


Using SSE in the kernel means saving/restoring the vector registers. It's worth it for the RAID5/RAID6 code. That code may only run from a dedicated thread, rather than from contexts where the vector/FPU registers still have another process's data.

Linux's memcpy can be used from any context, so it avoids using anything but the usual integer registers. I did find an article about an SSE kernel memcpy patch, where Andi Kleen and Ingo Molnar both say it wouldn't be good to always use SSE for memcpy. Maybe there could be a special bulk-memcpy for big copies where it's worth saving the vector regs.

You can use SSE in the kernel, but you have to wrap it in kernel_fpu_begin() and kernel_fpu_end(). On Linux 3.7 and later, kernel_fpu_end() actually does the work of restoring FPU state, so don't use a lot of fpu_begin/fpu_end pairs in a function. Also note that kernel_fpu_begin disables pre-emption, and you must not "do anything that might fault or sleep".

In theory, saving just one vector reg, like xmm0, would be good. You'd have to make sure you used SSE, not AVX instructions, because you need to avoid zeroing the upper part of ymm0 / zmm0. You might cause an AVX+SSE stall when you return to code that was using ymm regs. Unless you want to do a full save of the vector regs, you can't run vzeroupper. And even to do that, you'd need to detect AVX support...

However, doing even this one-reg save/restore would require you to take the same precautions as kernel_fpu_begin, and disable pre-emption. Since you'd be storing to your own private save slot (prob. on the stack), rather than to task_struct.thread.fpu, I'm not sure that even disabling pre-emption is enough to guarantee that user-space FPU state won't be corrupted. Maybe it is, but maybe it isn't, and I'm not a kernel hacker. Disabling interrupts to guard against this, too, is probably worse than just using kernel_fpu_begin()/kernel_fpu_end() to trigger a full FPU state save using XSAVE/XRSTOR.



回答3:

The link you mentioned is using non-temporal stores. I have discussed this several times before, for example here and here. I would suggest your read those before proceeding further.

But if you really want to produce the inline assembly code in the link you mentioned here is how you do it: use intrinsics instead.

The fact that you cannot compile that code with GCC is exactly one of the reasons intrinsics were created. Inline assembly has to be written differently for 32-bit and 64-bit code and typically has different syntax for each compiler. Intrinsics solve all these issues.

The following code should compile with GCC, Clang, ICC, and MSVC in both 32-bit and 64-bit mode.

#include "xmmintrin.h"
void X_aligned_memcpy_sse2(char* dest, const char* src, const unsigned long size)
{
    for(int i=size/128; i>0; i--) {
        __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
        _mm_prefetch(src + 128, _MM_HINT_NTA);
        _mm_prefetch(src + 160, _MM_HINT_NTA);
        _mm_prefetch(src + 194, _MM_HINT_NTA);
        _mm_prefetch(src + 224, _MM_HINT_NTA);

        xmm0 = _mm_load_si128((__m128i*)&src[   0]);
        xmm1 = _mm_load_si128((__m128i*)&src[  16]);
        xmm2 = _mm_load_si128((__m128i*)&src[  32]);
        xmm3 = _mm_load_si128((__m128i*)&src[  48]);
        xmm4 = _mm_load_si128((__m128i*)&src[  64]);
        xmm5 = _mm_load_si128((__m128i*)&src[  80]);
        xmm6 = _mm_load_si128((__m128i*)&src[  96]);
        xmm7 = _mm_load_si128((__m128i*)&src[ 112]);

        _mm_stream_si128((__m128i*)&dest[   0], xmm0);
        _mm_stream_si128((__m128i*)&dest[  16], xmm1);
        _mm_stream_si128((__m128i*)&dest[  32], xmm2);
        _mm_stream_si128((__m128i*)&dest[  48], xmm3);
        _mm_stream_si128((__m128i*)&dest[  64], xmm4);
        _mm_stream_si128((__m128i*)&dest[  80], xmm5);
        _mm_stream_si128((__m128i*)&dest[  96], xmm6);
        _mm_stream_si128((__m128i*)&dest[ 112], xmm7);
        src  += 128;
        dest += 128;
    }
}

Note that src and dest need to be 16 byte aligned and that size needs to be a multiple of 128.

I don't, however, advice to use this code. In the cases when non-temporal stores are useful loop unrolling is useless and explicit pre-fetching is rarely ever useful. You can simply do

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++) {
        _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
    }
}

more details as to why can be found here.


Here is the assembly from the X_aligned_memcpy_sse2 function using intrinsics with GCC -O3 -S -masm=intel. Notice that it's essentially the same as here.

    shr rdx, 7
    test    edx, edx
    mov eax, edx
    jle .L1
.L5:
    sub rsi, -128
    movdqa  xmm6, XMMWORD PTR [rsi-112]
    prefetchnta [rsi]
    prefetchnta [rsi+32]
    prefetchnta [rsi+66]
    movdqa  xmm5, XMMWORD PTR [rsi-96]
    prefetchnta [rsi+96]
    sub rdi, -128
    movdqa  xmm4, XMMWORD PTR [rsi-80]
    movdqa  xmm3, XMMWORD PTR [rsi-64]
    movdqa  xmm2, XMMWORD PTR [rsi-48]
    movdqa  xmm1, XMMWORD PTR [rsi-32]
    movdqa  xmm0, XMMWORD PTR [rsi-16]
    movdqa  xmm7, XMMWORD PTR [rsi-128]
    movntdq XMMWORD PTR [rdi-112], xmm6
    movntdq XMMWORD PTR [rdi-96], xmm5
    movntdq XMMWORD PTR [rdi-80], xmm4
    movntdq XMMWORD PTR [rdi-64], xmm3
    movntdq XMMWORD PTR [rdi-48], xmm2
    movntdq XMMWORD PTR [rdi-128], xmm7
    movntdq XMMWORD PTR [rdi-32], xmm1
    movntdq XMMWORD PTR [rdi-16], xmm0
    sub eax, 1
    jne .L5
.L1:
    rep ret