Write x86 asm functions portably (win/linux/osx),

par2 has a small and fairly clean C++ codebase, which I think builds fine on GNU/Linux, OS X, and Windows (with MSVC++).

I'd like to incorporate an x86-64 asm version of the one function that takes nearly all the CPU time. (mailing list posts with more details. My implementation/benchmark here.)

Intrinsics would be the obvious solution, but gcc doesn't generate good enough code for getting one byte at a time from a 64bit register for use as an index into a LUT. I might also take the time to schedule instructions so each uop cache line holds a multiple of 4 uops, since uop throughput is the bottleneck even when the input/output buffer is a decent size.

I'd prefer not to introduce a build-dependency on yasm, since many people have gcc installed, but not yasm.

Is there a way to write a function in asm in a separate file that gcc / clang and MSVC can assemble? The goals are:

no extra software as a build-dep. (no YASM).
only one version of each asm function. (no maintaining MASM & AT&T versions of the same code.)

Par2cmdline's build systems is autoconf/automake for Unix, MSVC .sln for Windows.

I know GNU assemble has a .intel_syntax noprefix directive, but that only changes instruction formats, not other assembler directives. e.g. .align 16 vs. align 16. My code is fairly simple and small, so it would be ok to work around the different directives with C-preprocessor #defines, if that can work.

I'm assuming that doing CPU-detection and setting a function pointer based on the result shouldn't be a problem in C++, even if I have to use some #ifdef conditional compilation for that.

If there isn't a solution to what I'm hoping for, I'll probably introduce a build-depend on yasm and have a ./configure --no-asm option to disable asm speedups for people building on x86 without yasm available.

My preferred plan for handling the different calling convention in the Windows and Linux ABIs was to use __attribute__((sysv_abi)) on my C prototypes for my asm functions. Then I only have to write the function prologue for the SysV ABI. Does MSVC has anything like that, that would put args into regs according to the SysV ABI for certain functions? (BTW, this tickled a compiler bug, so be careful with this idea if you want your code to work with current gcc.)

While I have no good solution to remove the dependency on a particular assembler I do have a suggestion on how to deal the two difference 64-bit calling conventions: Microsoft x64 versus SysV ABI.

The lowest commen denominator is the Microsoft x64 calling conventions since it can only pass the first four values by register. So if you limit yourself to this and use macros to define the registers you can easily make your code compile for both Unix (Linux/BSD/OSX) and Windows.

For example look in the file strcat64.asm in Agner Fog's asmlib

%IFDEF  WINDOWS
%define Rpar1   rcx                    ; function parameter 1
%define Rpar2   rdx                    ; function parameter 2
%define Rpar3   r8                     ; function parameter 3
%ENDIF
%IFDEF  UNIX
%define Rpar1   rdi                    ; function parameter 1
%define Rpar2   rsi                    ; function parameter 2
%define Rpar3   rdx                    ; function parameter 3
%ENDIF

        push    Rpar1                  ; dest
        push    Rpar2                  ; src
        call    A_strlen               ; length of dest
        push    rax                    ; strlen(dest)
        mov     Rpar1, [rsp+8]         ; src
        call    A_strlen               ; length of src
        pop     Rpar1                  ; strlen(dest)
        pop     Rpar2                  ; src
        add     Rpar1, [rsp]           ; dest + strlen(dest)
        lea     Rpar3, [rax+1]         ; strlen(src)+1
        call    A_memcpy               ; copy
        pop     rax                    ; return dest
        ret

;A_strcat ENDP

I don't think four registers is really a limitation because if you're writing something in assembly it's because you want the best efficiency in which case the function calling overhead should be negligible compare to the function itself so pushing/popping some values to/from the stack if you need to when calling the function should not make a difference in performance.