I'd like to implement a sandbox by ptrace()
ing a process I start and all its children would create (including grandchildren etc.). The ptrace()
parent process, i.e. the supervisor. would be a simple C or Python program, and conceptually it would limit filesystem access (based on the path name and the access direction (read or write) and socket access (e.g. disallowing socket creation).
What should I pay attention to so that the ptrace()
d process and its children (recursively) won't be able to bypass the sandbox? Is there anything special the supervisor should do at fork()
time to avoid race conditions? Is it possible to read the filename arguments of e.g. rename()
from child process without a race condition?
Here is what I've already planned to do:
PTRACE_O_TRACEFORK | PTRACE_O_TRACEVFORK | PTRACE_O_TRACECLONE
to avoid (some) race coditions when fork()
ing
- disallow all system calls by default, and compose a whitelist of allowed system calls
- make sure that the
*at()
system call variants (such as openat
) are properly protected
What else should I pay attention to?
The major problem is that many syscall arguments, like filenames, are passed to the kernel as userspace pointers. Any task that is allowed to run simultaneously and has write access to the memory that the pointer points to can effectively modify these arguments after they are inspected by your supervisor and before the kernel acts on them. By the time the kernel follows the pointer, the pointed-to contents may have been deliberately changed by another schedulable task (process or thread) with access to that memory. For example:
Thread 1 Supervisor Thread 2
-----------------------------------------------------------------------------------------------------
strcpy(filename, "/dev/null");
open(filename, O_RDONLY);
Check filename - OK
strcpy(filename, "/home/user/.ssh/id_rsa");
(in kernel) opens "/home/user/.ssh/id_rsa"
One way to stop this is to disallow calling clone()
with the CLONE_VM
flag, and in addition prevent any creation of writeable MAP_SHARED
memory mappings (or at least keep track of them such that you deny any syscall that tries to directly reference data from such a mapping). You could also copy any such argument into a non-shared bounce-buffer before allowing the syscall to proceed. This will effectively prevent any threaded application from running in the sandbox.
The alternative is to SIGSTOP
every other process in the traced group around every potentially dangerous syscall, wait for them to actually stop, then allow the syscall to proceed. After it returns, you then SIGCONT
them (unless they were already stopped). Needless to say, this may have a significant performance impact.
(There are also analogous problems with syscall arguments that are passed on the stack, and with shared open file tables).
Doesn't ptrace only get notifications after-the-fact? I don't think you have a chance to actually stop the syscall from happening, only to kill it as fast as you can once you see something "evil".
It seems like you're more looking for something like SELinux or AppArmor, where you can guarantee that not even one illegal call gets through.