I'm writing a device driver in linux for a PCIe device. This device driver performs several read and write to test the throughput. When I use the memcpy, the maximum payload for a TLP is 8 bytes ( on 64 bits architectures ). In my opinion the only way to get a payload of 16 bytes is to use the SSE instruction set. I've already seen this but the code doesn't compile ( AT&T/Intel syntax issue ).
- There is a way to use that code inside linux ?
- Does anyone know where I can found an implementation of a memcpy that moves 128 bits ?
Leaving this answer here for now, even though it's now clear the OP just wants a single 16B transfer. On Linux, his code is causing two 8B transfers over the PCIe bus.
For writing to MMIO space, it's worth trying
movnti
write-combining-store instructions. The source operand formovnti
is a GP register, not a vector reg.You can probably generate that with intrinsics, if you
#include <immintrin.h>
in your driver code. That should be fine in the kernel, as long as you're careful about what intrinsics you use. It doesn't define any globals.So most of this section isn't very relevant.
On most CPUs (where
rep movs
is good), Linux's memcpy uses it. It only uses a fallback to an explicit loop for CPUs whererep movsq
orrep movsb
are not good choices.When the size is a compile-time-constant, memcpy has an inline implementation using
rep movsl
(AT&T syntax forrep movsd
), then for cleanup: non-rep
movsw
andmovsb
if needed. (Actually kinda clunky, IMO, since the size is a compile-time constant. Also doesn't take advantage of fastrep movsb
on CPUs that have it.)Intel CPUs since P6 have had at least fairly good
rep movs
implementations. See Andy Glew's comments on it.But still, you're wrong about memcpy only moving in 64bit blocks, unless I'm misreading the code or you're on a platform where it decides to use the fallback loop.
Anyway, I don't think you're missing out on much perf by using the normal Linux
memcpy
, unless you've actually single-stepped your code and seen it doing something silly.For large copies, you'll want to set up DMA anyway. CPU usage by your driver is important, not just the max throughput you can obtain on an otherwise-idle system. (Be careful of trusting microbenchmarks too much.)
Using SSE in the kernel means saving/restoring the vector registers. It's worth it for the RAID5/RAID6 code. That code may only run from a dedicated thread, rather than from contexts where the vector/FPU registers still have another process's data.
Linux's memcpy can be used from any context, so it avoids using anything but the usual integer registers. I did find an article about an SSE kernel memcpy patch, where Andi Kleen and Ingo Molnar both say it wouldn't be good to always use SSE for memcpy. Maybe there could be a special bulk-memcpy for big copies where it's worth saving the vector regs.
You can use SSE in the kernel, but you have to wrap it in
kernel_fpu_begin()
andkernel_fpu_end()
. On Linux 3.7 and later, kernel_fpu_end() actually does the work of restoring FPU state, so don't use a lot of fpu_begin/fpu_end pairs in a function. Also note that kernel_fpu_begin disables pre-emption, and you must not "do anything that might fault or sleep".In theory, saving just one vector reg, like xmm0, would be good. You'd have to make sure you used SSE, not AVX instructions, because you need to avoid zeroing the upper part of ymm0 / zmm0. You might cause an AVX+SSE stall when you return to code that was using ymm regs. Unless you want to do a full save of the vector regs, you can't run vzeroupper. And even to do that, you'd need to detect AVX support...
However, doing even this one-reg save/restore would require you to take the same precautions as
kernel_fpu_begin
, and disable pre-emption. Since you'd be storing to your own private save slot (prob. on the stack), rather than totask_struct.thread.fpu
, I'm not sure that even disabling pre-emption is enough to guarantee that user-space FPU state won't be corrupted. Maybe it is, but maybe it isn't, and I'm not a kernel hacker. Disabling interrupts to guard against this, too, is probably worse than just usingkernel_fpu_begin()/kernel_fpu_end()
to trigger a full FPU state save using XSAVE/XRSTOR.The link you mentioned is using non-temporal stores. I have discussed this several times before, for example here and here. I would suggest your read those before proceeding further.
But if you really want to produce the inline assembly code in the link you mentioned here is how you do it: use intrinsics instead.
The fact that you cannot compile that code with GCC is exactly one of the reasons intrinsics were created. Inline assembly has to be written differently for 32-bit and 64-bit code and typically has different syntax for each compiler. Intrinsics solve all these issues.
The following code should compile with GCC, Clang, ICC, and MSVC in both 32-bit and 64-bit mode.
Note that
src
anddest
need to be 16 byte aligned and thatsize
needs to be a multiple of 128.I don't, however, advice to use this code. In the cases when non-temporal stores are useful loop unrolling is useless and explicit pre-fetching is rarely ever useful. You can simply do
more details as to why can be found here.
Here is the assembly from the
X_aligned_memcpy_sse2
function using intrinsics withGCC -O3 -S -masm=intel
. Notice that it's essentially the same as here.First of all you probably use GCC as the compiler and it uses the
asm
statement for inline assembler. When using that you will have to use a string literal for the assembler code (which will be copied into the assembler code before sending to the assembler - this means that the string should contain newline characters).Second you will probably have to use AT&T syntax for the assembler.
Third GCC uses extended asm to pass variables between assembler and C.
Fourth you should probably avoid inline assembler when possible anyway as the compiler wont have the possibility to schedule instructions past an
asm
statement (this was true at least). Instead you could maybe make use of GCC extensions like thevector_size
attribute:has the advantage that the compiler will produce code even if you compile for a processor that doesn't have the
mmx
registers, but perhaps some other 128-bit registers (or doesn't have vector registers at all).Fifth you should investigate if the provided
memcpy
isn't fast enough. Often thememcpy
is really optimized.Sixth you should take precaution if you're using special registers in the Linux kernel, there are registers that aren't saved during context switch. The SSE registers are a part of these.
Seventh as you using this to test throughput you should consider if the processor is a significant bottleneck in the equation. Compare the actual execution of the code with the reads from/writes to RAM (do you hit or miss the cache?) or the reads from/write to the peripheral.
Eighth when moving data you should avoid moving big chunks of data from RAM to RAM and if it's to/from a peripheral that has limited bandwidth you should definitely consider using DMA for that. Remember that if it's access time that limits the performance the CPU will still be considered busy (although it can't run at 100% speed).