Ηow to EXIT_SUCCESS after strict mode seccomp is set. Is it the correct practice, to call syscall(SYS_exit, EXIT_SUCCESS);
at the end of main?
#include <stdlib.h>
#include <unistd.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <sys/syscall.h>
int main(int argc, char **argv) {
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
//return EXIT_SUCCESS; // does not work
//_exit(EXIT_SUCCESS); // does not work
// syscall(__NR_exit, EXIT_SUCCESS); // (EDIT) This works! Is this the ultimate answer and the right way to exit success from seccomp-ed programs?
syscall(SYS_exit, EXIT_SUCCESS); // (EDIT) works; SYS_exit equals __NR_exit
}
// gcc seccomp.c -o seccomp && ./seccomp; echo "${?}" # I want 0
As explained in eigenstate.org and in SECCOMP (2):
As a result, one would expect
_exit()
to work, but it's a wrapper function that invokesexit_group(2)
which is not allowed in strict mode ([1], [2]), thus the process gets killed.It's even reported in exit(2) - Linux man page:
Same happens with the
return
statement, which should end up in killing your process, in the very similar manner with_exit()
.Stracing the process will provide further confirmation (to allow this to show up, you have to not set PR_SET_SECCOMP; just comment
prctl()
) and I got similar output for both non-working cases:As you can see,
exit_group()
is called, explaining everything!Now as you correctly stated, "
SYS_exit equals __NR_exit
"; for example it's defined in mit.syscall.h:so the last two calls are equivalent, i.e. you can use the one you like, and the output should be this:
PS
You could of course define a
filter
yourself and use:as explained in the eigenstate link, to allow
_exit()
(or, strictly speaking,exit_group(2)
), but do that only if you really need to and know what you are doing.The problem occurs, because the GNU C library uses the
exit_group
syscall, if it is available, in Linux instead ofexit
, for the_exit()
function (seesysdeps/unix/sysv/linux/_exit.c
for verification), and as documented in theman 2 prctl
, theexit_group
syscall is not allowed by the strict seccomp filter.Because the
_exit()
function call occurs inside the C library, we cannot interpose it with our own version (that would just do theexit
syscall). (The normal process cleanup is done elsewhere; in Linux, the_exit()
function only does the final syscall that terminates the process.)We could ask the GNU C library developers to use the
exit_group
syscall in Linux only when there are more than one thread in the current process, but unfortunately, it would not be easy, and even if added right now, would take quite some time for the feature to be available on most Linux distributions.Fortunately, we can ditch the default strict filter, and instead define our own. There is a small difference in behaviour: the apparent signal that kills the process will change from
SIGKILL
toSIGSYS
. (The signal is not actually delivered, as the kernel does kill the process; only the apparent signal number that caused the process to die changes.)Furthermore, this is not even that difficult. I did waste a bit of time looking into some GCC macro trickery that would make it trivial to manage the allowed syscalls' list, but I decided it would not be a good approach: the list of allowed syscalls should be carefully considered -- we only add
exit_group()
compared to the strict filter, here! -- so making it a bit difficult is okay.The following code, say
example.c
, has been verified to work on a 4.4 kernel (should work on kernels 3.5 or later) on x86-64 (for both x86 and x86-64, i.e. 32-bit and 64-bit binaries). It should work on all Linux architectures, however, and it does not require or use the libseccomp library.Compile using e.g.
and run using
or under
strace
to see the syscalls and library calls done;The
strict_filter
BPF program is really trivial. The first opcode loads the syscall number into the accumulator. The next five opcodes compare it to an acceptable syscall number, and if found, jump to the final opcode that allows the syscall. Otherwise the second-to-last opcode kills the process.Note that although the documentation refers to
sigreturn
being the allowed syscall, the actual name of the syscall in Linux isrt_sigreturn
. (sigreturn
was deprecated in favour ofrt_sigreturn
ages ago.)Furthermore, when the filter is installed, the opcodes are copied to kernel memory (see
kernel/seccomp.c
in the Linux kernel sources), so it does not affect the filter in any way if the data is modified later. Having the structuresstatic const
has zero security impact, in other words.I used
static
since there is no need for the symbols to be visible outside this compilation unit (or in a stripped binary), andconst
to put the data into the read-only data section of the ELF binary.The form of a
BPF_JUMP(BPF_JMP | BPF_JEQ, nr, equals, differs)
is simple: the accumulator (the syscall number) is compared tonr
. If they are equal, then the nextequals
opcodes are skipped. Otherwise, the nextdiffers
opcodes are skipped.Since the equals cases jump to the very final opcode, you can add new opcodes at the top (that is, just after the initial opcode), incrementing the equals skip count for each one.
Note that
printf()
will not work after the seccomp filter is installed, because internally, the C library wants to do afstat
syscall (on standard output), and abrk
syscall to allocate some memory for a buffer.