I have a short code that uses the multiprocessing
package and works fine on my local machine.
When I uploaded to AWS Lambda
and run there, I got the following error (stacktrace trimmed):
[Errno 38] Function not implemented: OSError
Traceback (most recent call last):
File "/var/task/recorder.py", line 41, in record
pool = multiprocessing.Pool(10)
File "/usr/lib64/python2.7/multiprocessing/__init__.py", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 138, in __init__
self._setup_queues()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 234, in _setup_queues
self._inqueue = SimpleQueue()
File "/usr/lib64/python2.7/multiprocessing/queues.py", line 354, in __init__
self._rlock = Lock()
File "/usr/lib64/python2.7/multiprocessing/synchronize.py", line 147, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1)
File "/usr/lib64/python2.7/multiprocessing/synchronize.py", line 75, in __init__
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 38] Function not implemented
Can it be that a part of python's core packages in not implemented? I have no idea what am I running on underneath so I can't login there and debug.
Any ideas how can I run multiprocessing
on Lambda?
As far as I can tell, multiprocessing won't work on AWS Lambda because the execution environment/container is missing /dev/shm
- see https://forums.aws.amazon.com/thread.jspa?threadID=219962 (login may be required).
No word (that I can find) on if/when Amazon will change this, so I'm looking at other libraries e.g. https://pythonhosted.org/joblib/parallel.html will fallback to /tmp
(which we know DOES exist) if it can't find /dev/shm
.
multiprocessing.Pool
does not seem to be supported natively (because of the issue with SemLock
), but multiprocessing.Process
, multiprocessing.Queue
, multiprocessing.Pipe
etc work properly in AWSLambda.
That should allow you to build a workaround solution by manually creating/forking processes and using a multiprocessing.Pipe
for communication between the parent and child processes. Hope that helps
You can run routines in parallel on AWS Lambda using Python's multiprocessing module but you can't use Pools or Queues as noted in other answers. A workable solution is to use Process and Pipe as outlined in this article https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/
While the article definitely helped me get to a solution (shared below) there are a few things to be aware of. First, the Process and Pipe based solution is not as fast as the built in map function in Pool, though I did see nearly linear speed-ups as I increased the available memory/CPU resources in my Lambda function. Second there is a fair bit of management that has to be undertaken when developing multiprocessing functions in this way. I suspect this is at least partly why my solution is slower than the built in methods. If anyone has suggestions to speed it up I'd love to hear them! Finally, while the article notes that multiprocessing is useful for offloading asynchronous processes, there are other reasons to use multiprocessing such as lots of intensive math ops which is what I was trying to do. In the end I was happy enough with the performance improvement as it was much better than sequential execution!
The code:
# Python 3.6
from multiprocessing import Pipe, Process
def myWorkFunc(data, connection):
result = None
# Do some work and store it in result
if result:
connection.send([result])
else:
connection.send([None])
def myPipedMultiProcessFunc():
# Get number of available logical cores
plimit = multiprocessing.cpu_count()
# Setup management variables
results = []
parent_conns = []
processes = []
pcount = 0
pactive = []
i = 0
for data in iterable:
# Create the pipe for parent-child process communication
parent_conn, child_conn = Pipe()
# create the process, pass data to be operated on and connection
process = Process(target=myWorkFunc, args=(data, child_conn,))
parent_conns.append(parent_conn)
process.start()
pcount += 1
if pcount == plimit: # There is not currently room for another process
# Wait until there are results in the Pipes
finishedConns = multiprocessing.connection.wait(parent_conns)
# Collect the results and remove the connection as processing
# the connection again will lead to errors
for conn in finishedConns:
results.append(conn.recv()[0])
parent_conns.remove(conn)
# Decrement pcount so we can add a new process
pcount -= 1
# Ensure all remaining active processes have their results collected
for conn in parent_conns:
results.append(conn.recv()[0])
conn.close()
# Process results as needed