I compiled a libsvm benchmarking app which does svm_predict() 100 times on the same image using the same model. The libsvm is compiled statically (MSVC 2017) by directly including svm.cpp and svm.h in my project.
EDIT: adding benchmark details
for (int i = 0; i < counter; i++)
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
double label = svm_predict(model, input);
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
total_time += duration;
std::cout << "\n\n\n" << sum << " label:" << label << " duration:" << duration << "\n\n\n";
}
This is the loop that I benchmark without any major modifications to the libsvm code.
After 100 runs the average of one run is 4.7 ms with no difference if I use or not AVX instructions. To make sure the compiler generates the correct instructions I used Intel Software Development Emulator to check the instructions mix
with AVX:
*isa-ext-AVX 36578280
*isa-ext-SSE 4
*isa-ext-SSE2 4
*isa-set-SSE 4
*isa-set-SSE2 4
*scalar-simd 36568174
*sse-scalar 4
*sse-packed 4
*avx-scalar 36568170
*avx128 8363
*avx256 1765
The other part
without AVX:
*isa-ext-SSE 11781
*isa-ext-SSE2 36574119
*isa-set-SSE 11781
*isa-set-SSE2 36574119
*scalar-simd 36564559
*sse-scalar 36564559
*sse-packed 21341
I would expect to get some performance improvment I know that avx128/256/512 are not used that much but still. I have a i7-8550U CPU, do you think that if run the same test on a skylake i9 X series I would see a bigger difference ?
EDIT I added the instruction mix for each binary
With AVX:
ADD 16868725
AND 49
BT 6
CALL_NEAR 14032515
CDQ 4
CDQE 3601
CMOVLE 6
CMOVNZ 2
CMOVO 12
CMOVZ 6
CMP 25417120
CMPXCHG_LOCK 1
CPUID 3
CQO 12
DEC 68
DIV 1
IDIV 12
IMUL 3621
INC 8496372
JB 325
JBE 5
JL 7101
JLE 38338
JMP 8416984
JNB 6
JNBE 3
JNL 806
JNLE 61
JNS 1
JNZ 22568320
JS 2
JZ 8465164
LEA 16829868
MOV 42209230
MOVSD_XMM 4
MOVSXD 1141
MOVUPS 4
MOVZX 3684
MUL 12
NEG 72
NOP 4219
NOT 1
OR 14
POP 1869
PUSH 1870
REP_STOSD 6
RET_NEAR 1758
ROL 5
ROR 10
SAR 8
SBB 5
SETNZ 4
SETZ 26
SHL 1626
SHR 519
SUB 6530
TEST 5616533
VADDPD 594
VADDSD 8445597
VCOMISD 3
VCVTSI2SD 3603
VEXTRACTF128 6
VFMADD132SD 12
VFMADD231SD 6
VHADDPD 6
VMOVAPD 12
VMOVAPS 2375
VMOVDQU 1
VMOVSD 11256384
VMOVUPD 582
VMULPD 582
VMULSD 8451540
VPXOR 1
VSUBSD 8407425
VUCOMISD 3600
VXORPD 2362
VXORPS 3603
VZEROUPPER 4
XCHG 8
XGETBV 1
XOR 8414763
*total 213991340
Part2
No AVX:
ADD 16869910
ADDPD 1176
ADDSD 8445609
AND 49
BT 6
CALL_NEAR 14032515
CDQ 4
CDQE 3601
CMOVLE 6
CMOVNZ 2
CMOVO 12
CMOVZ 6
CMP 25417408
CMPXCHG_LOCK 1
COMISD 3
CPUID 3
CQO 12
CVTDQ2PD 3603
DEC 68
DIV 1
IDIV 12
IMUL 3621
INC 8496369
JB 325
JBE 5
JL 7392
JLE 38338
JMP 8416984
JNB 6
JNBE 3
JNL 803
JNLE 61
JNS 1
JNZ 22568317
JS 2
JZ 8465164
LEA 16829548
MOV 42209235
MOVAPS 7073
MOVD 3603
MOVDQU 2
MOVSD_XMM 11256376
MOVSXD 1141
MOVUPS 2344
MOVZX 3684
MUL 12
MULPD 1170
MULSD 8451546
NEG 72
NOP 4159
NOT 1
OR 14
POP 1865
PUSH 1866
REP_STOSD 6
RET_NEAR 1758
ROL 5
ROR 10
SAR 8
SBB 5
SETNZ 4
SETZ 26
SHL 1626
SHR 516
SUB 6515
SUBSD 8407425
TEST 5616533
UCOMISD 3600
UNPCKHPD 6
XCHG 8
XGETBV 1
XOR 8414745
XORPS 2364
*total 214000270
Almost all arithmetic instructions you are listing work on scalars e.g.,
(V)SUBSD
means SUBstract Scalar Double. TheV
in front essentially just means that AVX encoding is used (this also clears the upper half of the register, which the SSE instructions don't do). But given the instructions you listed, there should be barely any runtime difference.Modern x86 uses SSE1/2 or AVX for scalar FP math, using just the low element of XMM vector registers. It's somewhat better than x87 (more registers, and flat register set), but it's still only one result per instruction.
There are a few thousand packed SIMD instructions, vs. ~36 million scalar instructions, so only a relatively unimportant part of the code got auto-vectorized and could benefit from 256-bit vectors.