According to
https://github.com/joblib/joblib/issues/180, and Is there a safe way to create a subprocess from a thread in python?
the Python multiprocessing module does not allow use from within threads. Is this true?
My understanding is that its fine to fork from threads, as long as you
aren't holding a threading.Lock when you do so (in the current thread? anywhere in the process?). However, Python's documentation is silent on whether threading.Lock objects are safely shared after a fork.
There's also this: locks shared from the logging module causes issues with fork. https://bugs.python.org/issue6721
I'm not sure how this issue arises. It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock). If so, does using multiprocessing really provide any protection against this (since I'm free to create my multiprocessing.Pool after threading.Lock is created and entered by other threads, and after threads have started that using the not-fork-safe logging module) -- the multiprocessing module docs are also silent about whether multiprocessing.Pools should be allocated before Locks.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?
It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock).
It is not a design error, rather, fork()
predates single-process multithreading. The state of all locks is copied into the child process because they're just objects in memory; the entire address-space of the process is copied as is in fork. There are only bad alternatives: either copy all threads over fork, or deny forking in multithreaded application.
Therefore, fork()
ing in a multithreading program was never the safe thing to do, unless then followed by execve()
or exit()
in the child process.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?
No. Nothing makes it safe to combine threads and forks, it cannot be done.
The problem is that when you have multiple threads in a process, after fork()
system call you cannot continue safely running the program in POSIX systems.
For example, Linux manuals fork(2)
:
- After a
fork(2)
in a multithreaded program, the child can safely call
only async-signal-safe functions (see signal(7)
) until such time as it
calls execve(2)
.
I.e. it is OK to fork()
in a multithreaded program and then only call async-signal-safe C functions (which is a rather limited subset of C functions), until the child process has been replaced with another executable!
Unsafe C function calls in child processes are then for example
malloc
for dynamic memory allocation
- any
<stdio.h>
functions for formatted input
- most of the
pthread_*
functions required for thread state handling, including creation of new threads...
Thus there is very little what the child process can actually safely do. Unfortunately CPython core developers have been downplaying the problems caused by this. Even now the documentation says:
Note that safely forking a multithreaded process is
problematic.
Quite an euphemism for "impossible".
It is safe to use multiprocessing from a Python process that has multiple threads of control provided that you're not using the fork
start method; in Python 3.4+ it is now possible to change the start method. In previous Python versions including all of Python 2, the POSIX systems always behaved as if fork
was specified as the start method; this would result in undefined behaviour.
The problems are not limited to just threading.Lock
objects but all locks held by the C standard library, the C extensions etc. What is worse that most of the time people would say "it works for me"... until it stops from working.
There were even a cases where a seemingly single-threading Python program is actually multithreading in MacOS X, causing failures and deadlocks upon using multiprocessing.
Another problem is that all opened file handles, their use, shared sockets might behave oddly in programs that forks, but that would be the case even in single-threaded programs.
TL;DR: using multiprocessing
in multithreaded programs, with C extensions, with opened sockets etc:
- fine in 3.4+ & POSIX if you explicitly specify a starting method that is not
fork
,
- fine in Windows because it doesn't support forking;
- in Python 2 - 3.3 on POSIX: you'll mostly shoot yourself in the foot.