I'm trying to create C# app which uses dll library which contains C++ code and inline assembly. In function test_MMX I want to add two arrays of specific length.
extern "C" __declspec(dllexport) void __stdcall test_MMX(int *first_array,int *second_array,int length)
{
__asm
{
mov ecx,length;
mov esi,first_array;
shr ecx,1;
mov edi,second_array;
label:
movq mm0,QWORD PTR[esi];
paddd mm0,QWORD PTR[edi];
add edi,8;
movq QWORD PTR[esi],mm0;
add esi,8;
dec ecx;
jnz label;
}
}
After run app it's showing this warning:
warning C4799: function 'test_MMX' has no EMMS instruction.
When I want to measure time of running this function C# in miliseconds it returns this value: -922337203685477
instead of (for example 0,0141
)...
private Stopwatch time = new Stopwatch();
time.Reset();
time.Start();
test_MMX(first_array, second_array, length);
time.Stop();
TimeSpan interval = time.Elapsed;
return trvanie.TotalMilliseconds;
Any ideas how to fix it please ?
Since MMX aliases over the floating-point registers, any routine that uses MMX instructions must end with the
EMMS
instruction. This instruction "clears" the registers, making them available for use by the x87 FPU once again. (Which any C or C++ calling convention for x86 will assume is safe.)The compiler is warning you that you have written a routine that uses MMX instructions but does not end with the
EMMS
instruction. That's a bug waiting to happen, as soon as some FPU instruction tries to execute.This is a huge disadvantage of MMX, and the reason why you really can't freely intermix MMX and floating-point instructions. Sure, you could just throw
EMMS
instructions around, but it is a slow, high-latency instruction, so this kills performance. SSE had the same limitations as MMX in this regard, at least for integer operations. SSE2 was the first instruction set to address this problem, since it used its own discrete register set. Its registers are also twice as wide as MMX's are, so you can do even more at a time. Since SSE2 does everything that MMX does, but faster, easier, and more efficiently, and is supported by the Pentium 4 and later, it is quite rare that anyone needs to write new code today that uses MMX. If you can use SSE2, you should. It will be faster than MMX. Another reason not to use MMX is that it is not supported in 64-bit mode.Anyway, the correct way to write the MMX code would be:
Note that, in addition to the
EMMS
instruction (which, of course, is placed outside of the loop), I made a few additional changes:ESI
andEDI
registers in inline assembly, since the inline assembler will detect their use and generate additional instructions that push/pop them accordingly. In fact, it will do this with all non-volatile registers. And if you need additional registers, then you need them, and this is a nice feature. But in this code, you're only using three general-purpose registers, and in the__stdcall
calling convention, there are three general-purpose registers that are specifically defined as volatile (i.e., can be freely clobbered by any function):EAX
,EDX
, andECX
. So you should be using those registers for maximum speed. As such, I've changed your use ofESI
toEAX
, and your use ofEDI
toEDX
. This will improve the code that you can't see, the prologue and epilogue automatically generated by the compiler.You have a potential speed trap lurking here, though, and that is alignment. To obtain maximum speed, MMX instructions need to operate on data that is aligned on 8-byte boundaries. In a loop, misaligned data has a compounding negative effect on performance: not only is the data misaligned the first time through the loop, exerting a significant performance penalty, but it is guaranteed to be misaligned each subsequent time through the loop, too. So for this code to have any chance of being fast, the caller needs to guarantee that
first_array
andsecond_array
are aligned on 8-byte boundaries.If you can't guarantee that, then the function should really have extra code added to it to fix up misalignments. Essentially, you want to do a couple of non-vector operations (on individual bytes) at the beginning, before starting the loop, until you've reached a suitable alignment. Then, you can start issuing the vectorized MMX instructions.
(Unaligned loads are no longer penalized on modern processors, but if you were targeting modern processors, you'd be writing SSE2 code. On the older processors where you need to run MMX code, alignment will be a big deal, and misaligned data will kill your performance.)
Now, this inline assembly won't produce particularly efficient code. When you use inline assembly, the compiler always generates prologue and epilogue code for the function. That isn't terrible, since it's outside of the critical inner loop, but still—it's cruft you don't need. Worse, jumps in inline assembly blocks tend to confuse MSVC's inline assembler and cause it to generate sub-optimal code. It is overly cautious, preventing you from doing something that could corrupt the stack or cause other external side effects, which is nice, except that the whole reason you're writing inline assembly is (presumably) because you desire maximum performance.
(It should go without saying, but if you don't need the maximum possible performance, you should just write the code in C (or C++) and let the compiler optimize it. It does a darn good job in the majority of cases.)
If you do need the maximum possible performance, and have decided that the compiler-generated code just won't cut it, then a better alternative to inline assembly is the use of intrinsics. Intrinsics will generally map one-to-one to assembly-language instructions, but the compiler does a lot better job optimizing around them.
Here's my version of your code, using MMX intrinsics:
It does the same thing, but more efficiently by delegating more to the compiler's optimizer. A couple of notes:
length
as an unsigned integer, I assume that your interface requires that it actually be an unsigned integer. (And, if so, I wonder why you don't declare it as such in the function's signature.) To achieve the same effect, I've cast it to anunsigned int
, which is subsequently used as thecounter
. (If I hadn't done that, I'd have to have either done a shift operation on a signed integer, which risks undefined behavior, or a division by two, for which the compiler would have generated slower code to correctly deal with the sign bit.)*reinterpret_cast<__m64*>
business scattered throughout looks scary, but is actually safe—at least, relatively speaking. That's what you're supposed to do with the MMX intrinsics. The MMX data type is__m64
, which you can think of as being roughly equivalent to anmm?
register. It is 64 bits in length, and loads and stores are accomplished by casting. These get translated directly intoMOVQ
instructions.do
…while
loop. This means the test of the loop condition only has to be done at the bottom of the loop, rather than once at the top and once at the bottom._mm_empty()
intrinsic causes anEMMS
instruction to be emitted.Just for grins, let's see what the compiler transformed this into. This is the output from MSVC 16 (VS 2010), targeting x86-32 and optimizing for speed over size (though it makes no difference in this particular case):
It is recognizably similar to your original code, but does a couple of things differently. In particular, it tracks the array pointers differently, in a way that it (and I) believe is slightly more efficient than your original code, since it does less work inside of the loop. It also breaks apart your
PADDD
instruction so that both of its operands are MMX registers, instead of the source being a memory operand. Again, this tends to make the code more efficient at the expense of clobbering an additional MMX register, but we've got plenty of those to spare, so it's certainly worth it.Better yet, as the optimizer improves in newer versions of the compiler, code that is written using intrinsics may get even better!
Of course, rewriting the function to use intrinsics doesn't solve the alignment problem, but I'm assuming you have already dealt with that on the caller side. If not, you'll need to add code to handle it.
If you wanted to use SSE2—perhaps that would be
test_SSE2
and you would dynamically delegate to the appropriate implementation depending on the current processor's feature bits—then you could do it like this:I've written this code not assuming alignment, so it will work when the loads and stores are misaligned. For maximum speed on many older architectures, SSE2 requires 16-byte alignment, and if you can guarantee that the source and destination pointers are thusly aligned, you can use slightly faster instructions (e.g.,
MOVDQA
as opposed toMOVDQU
). As mentioned above, on newer architectures (at least Sandy Bridge and later, perhaps earlier), it doesn't matter.To give you an idea of how SSE2 is basically just a drop-in replacement for MMX on Pentium 4 and later, except that you also get to do operations that are twice as wide, look at the code this compiles to:
As for the final question about getting negative values from the .NET Stopwatch class, I would normally guess that would be due to an overflow. In other words, your code executed too slowly, and the timer wrapped around. Kevin Gosse pointed out, though, that this is apparently a bug in the implementation of the
Stopwatch
class. I don't know much more about it, since I don't really use it. If you want a good microbenchmarking library, I use and recommend Google Benchmark. However, it is for C++, not C#.While you're benchmarking, definitely take the time to time the code generated by the compiler when you write it the naïve way. Say, something like:
You just might be pleasantly surprised at how fast the code is after the compiler gets finished auto-vectorizing the loop. :-) Remember that less code does not necessarily mean faster code. All of that extra code is required to deal with alignment issues, which I've diplomatically skirted throughout this answer. If you scroll down, at
$LL4@Naive_Pack
, you'll find an inner loop very similar to what we've been considering here.