Disable AVX-optimized functions in glibc (LD_HWCAP

2019-01-12 00:20发布

Modern x86_64 linux with glibc will detect that CPU has support of AVX extension and will switch many string functions from generic implementation to AVX-optimized version (with help of ifunc dispatchers: 1, 2).

This feature can be good for performance, but it prevents several tool like valgrind (older libVEXs, before valgrind-3.8) and gdb's "target record" (Reverse Execution) from working correctly (Ubuntu "Z" 17.04 beta, gdb 7.12.50.20170207-0ubuntu2, gcc 6.3.0-8ubuntu1 20170221, Ubuntu GLIBC 2.24-7ubuntu2):

$ cat a.c
#include <string.h>
#define N 1000
int main(){
        char src[N], dst[N];
        memcpy(dst, src, N);
        return 0;
}
$ gcc a.c -o a -fno-builtin
$ gdb -q ./a
Reading symbols from ./a...(no debugging symbols found)...done.
(gdb) start
Temporary breakpoint 1 at 0x724
Starting program: /home/user/src/a

Temporary breakpoint 1, 0x0000555555554724 in main ()
(gdb) record
(gdb) c
Continuing.
Process record does not support instruction 0xc5 at address 0x7ffff7b60d31.
Process record: failed to record execution log.

Program stopped.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:416
416             VMOVU   (%rsi), %VEC(4)
(gdb) x/i $pc
=> 0x7ffff7b60d31 <__memmove_avx_unaligned_erms+529>:   vmovdqu (%rsi),%ymm4

There is error message "Process record does not support instruction 0xc5" from gdb's implementation of "target record", because AVX instructions are not supported by the record/replay engine (sometimes the problem is detected on _dl_runtime_resolve_avx function): https://sourceware.org/ml/gdb/2016-08/msg00028.html "some AVX instructions are not supported by process record", https://bugs.launchpad.net/ubuntu/+source/gdb/+bug/1573786, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=836802, https://bugzilla.redhat.com/show_bug.cgi?id=1136403

Solution proposed in https://sourceware.org/ml/gdb/2016-08/msg00028.html "You can recompile libc (thus ld.so), or hack __init_cpu_features and thus __cpu_features at runtime (see e.g. strcmp)." or set LD_BIND_NOW=1, but recompiled glibc still has AVX, and ld bind-now doesn't help.

I heard that there are /etc/ld.so.nohwcap and LD_HWCAP_MASK configurations in glibc. Can they be used to disable ifunc dispatching to AVX-optimized string functions in glibc?

How does glibc (rtld?) detects AVX, using cpuid, with /proc/cpuinfo (probably not), or HWCAP aux (LD_SHOW_AUXV=1 /bin/echo |grep HWCAP command gives AT_HWCAP: bfebfbff)?

3条回答
Root(大扎)
2楼-- · 2019-01-12 00:50

Not the best or complete solution, just a smallest bit-editing kludge to allow valgrind and gdb record for the my task.

Lekensteyn asks:

how to mask out AVX/SSE without recompiling glibc

I did full rebuild of unmodified glibc, which is rather easy in debian and ubuntu: just sudo apt-get source glibc, sudo apt-get build-dep glibc and cd glibc-*/; dpkg-buildpackage -us -uc (manual to get the ld.so without stripped debugging information.

Then I did binary (bit) patching of the output ld.so file, in the function used by __get_cpu_features. Target function was compiled from get_common_indeces of source file sysdeps/x86/cpu-features.c under the name of get_common_indeces.constprop.1 (it is just next after the __get_cpu_features in the binary code). It has several cpuids, first one is cpuid eax=1 "Processor Info and Feature Bits"; and later there is check "jle 0x6" and jump down around the code "cpuid eax=7 ecx=0 Extended Features" just to get AVX2 status. There is the code which was compiled into this logic:

get_common_indeces (struct cpu_features *cpu_features,
            unsigned int *family, unsigned int *model,
            unsigned int *extended_model, unsigned int *stepping)
{ ...
  if (cpu_features->max_cpuid >= 7)
    __cpuid_count (7, 0,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].eax,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].ebx,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].ecx,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].edx);

The cpu_features->max_cpuid was filled in init_cpu_features of the same file in __cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx); line. It was easier to disable the if statement by replacing jle after cmp 0x6 with jg (byte 0x7e to 0x7f). (Actually this binary patch was reapplied manually to the __get_cpu_features function of real system ld-linux.so.2 - first jle before mov 7 eax; xor ecx,ecx; cpuid changed into jg.)

Recompiled package and modified ld.so were not installed into the system; I used commandline syntax of ld.so ./my_program (or mv ld.so /some/short/path.so and patchelf --set-interpreter ./my_program).

Other possible solutions:

  • try to use more recent valgrind & gdb record tools
  • try to use older glibc
  • implement missing instruction emulation in gdb record if it is not done
  • do source code patching around if (cpu_features->max_cpuid >= 7) in glibc and recompile
  • do source code patching around avx2-enabled string functions in glibc and recompile
查看更多
Explosion°爆炸
3楼-- · 2019-01-12 00:56

I heard that there are /etc/ld.so.nohwcap and LD_HWCAP_MASK configurations in glibc. Can they be used to disable ifunc dispatching to AVX-optimized string functions in glibc?

Yes: setting LD_HWCAP_MASK=0 will make GLIBC pretend that none of the CPU capabilities are available. Code.

Setting the mask to 0 is likely to trigger an error, you'll likely need to figure out the precise bit that controls AVX, and mask just that bit.

查看更多
【Aperson】
4楼-- · 2019-01-12 01:04

There does not seem a straightforward runtime method to patch feature detection. This detection happens rather early in the dynamic linker (ld.so).

Binary patching the linker seems the easiest method at the moment. @osgx described one method where a jump is overwritten. Another approach is just to fake the cpuid result. Normally cpuid(eax=0) returns the highest supported function in eax while the manufacturer IDs are returned in registers ebx, ecx and edx. We have this snippet in glibc 2.25 sysdeps/x86/cpu-features.c:

__cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx);

/* This spells out "GenuineIntel".  */
if (ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69)
  {
      /* feature detection for various Intel CPUs */
  }
/* another case for AMD */
else
  {
    kind = arch_kind_other;
    get_common_indeces (cpu_features, NULL, NULL, NULL, NULL);
  }

The __cpuid line translates to these instructions in /lib/ld-linux-x86-64.so.2 (/lib/ld-2.25.so):

172a8:       31 c0                   xor    eax,eax
172aa:       c7 44 24 38 00 00 00    mov    DWORD PTR [rsp+0x38],0x0
172b1:       00 
172b2:       c7 44 24 3c 00 00 00    mov    DWORD PTR [rsp+0x3c],0x0
172b9:       00 
172ba:       0f a2                   cpuid  

So rather than patching branches, we could as well change the cpuid into a nop instruction which would result in invocation of the last else branch (as the registers will not contain "GenuineIntel"). Since initially eax=0, cpu_features->max_cpuid will also be 0 and the if (cpu_features->max_cpuid >= 7) will also be bypassed.

Binary patching cpuid(eax=0) by nop this can be done with this utility (works for both x86 and x86-64):

#!/usr/bin/env python
import re
import sys

infile, outfile = sys.argv[1:]
d = open(infile, 'rb').read()
# Match CPUID(eax=0), "xor eax,eax" followed closely by "cpuid"
o = re.sub(b'(\x31\xc0.{0,32})\x0f\xa2', b'\\1\x66\x90', d)
assert d != o
open(outfile, 'wb').write(o)

That was the easy part. Now, I did not want to replace the system-wide dynamic linker, but execute only one particular program with this linker. Sure, that can be done with ./ld-linux-x86-64-patched.so.2 ./a, but the naive gdb invocations failed to set breakpoints:

$ gdb -q -ex "set exec-wrapper ./ld-linux-x86-64-patched.so.2" -ex start ./a
Reading symbols from ./a...done.
Temporary breakpoint 1 at 0x400502: file a.c, line 5.
Starting program: /tmp/a 
During startup program exited normally.
(gdb) quit
$ gdb -q -ex start --args ./ld-linux-x86-64-patched.so.2 ./a
Reading symbols from ./ld-linux-x86-64-patched.so.2...(no debugging symbols found)...done.
Function "main" not defined.
Temporary breakpoint 1 (main) pending.
Starting program: /tmp/ld-linux-x86-64-patched.so.2 ./a
[Inferior 1 (process 27418) exited normally]
(gdb) quit                                                                                                                                                                         

A manual workaround is described in How to debug program with custom elf interpreter? It works, but it is unfortunately a manual action using add-symbol-file. It should be possible to automate it a bit using GDB Catchpoints though.

An alternative approach that does not binary linking is LD_PRELOADing a library that defines custom routines for memcpy, memove, etc. This will then take precedence over the glibc routines. The full list of functions is available in sysdeps/x86_64/multiarch/ifunc-impl-list.c. Current HEAD has more symbols compared to the glibc 2.25 release, in total (grep -Po 'IFUNC_IMPL \(i, name, \K[^,]+' sysdeps/x86_64/multiarch/ifunc-impl-list.c):

memchr, memcmp, __memmove_chk, memmove, memrchr, __memset_chk, memset, rawmemchr, strlen, strnlen, stpncpy, stpcpy, strcasecmp, strcasecmp_l, strcat, strchr, strchrnul, strrchr, strcmp, strcpy, strcspn, strncasecmp, strncasecmp_l, strncat, strncpy, strpbrk, strspn, strstr, wcschr, wcsrchr, wcscpy, wcslen, wcsnlen, wmemchr, wmemcmp, wmemset, __memcpy_chk, memcpy, __mempcpy_chk, mempcpy, strncmp, __wmemset_chk,

查看更多
登录 后发表回答