I have a complex python object, of size ~36GB in memory, which I would like to share between multiple separate python processes. It is stored on disk as a pickle file, which I currently load separately for every process. I want to share this object to enable execution of more processes in parallel, under the amount of memory available.
This object is used, in a sense, as a read-only database. Every process initiates multiple access requests per second, and every request is just for a small portion of the data.
I looked into solutions like Radis, but I saw that eventually, the data needs to be serialized into a simple textual form. Also, mapping the pickle file itself to memory should not help because it will need to be extracted by every process. So I thought about two other possible solutions:
- Using a shared memory, where every process can access the address in which the object is stored. The problem here is that the process will only see a bulk of bytes, which cannot be interpreted
- Writing a code that holds this object and manages retrieval of data, through API calls. Here, I wonder about the performance of such solution in terms of speed.
Is there a simple way to implement either of these solutions? Perhaps there is a better solution for this situation?
Many thanks!
For complex objects there isn't readily available method to directly share memory between processes. If you have simple
ctypes
you can do this in a c-style shared memory but it won't map directly to python objects.There is a simple solution that works well if you only need a portion of your data at any one time, not the entire 36GB. For this you can use a
SyncManager
frommultiprocessing.managers
. Using this, you setup a server that serves up a proxy class for your data (your data isn't stored in the class, the proxy only provides access to it). Your client then attaches to the server using aBaseManager
and calls methods in the proxy class to retrieve the data.Behind the scenes the
Manager
classes take care of pickling the data you ask for and sending it through the open port from server to client. Because you're pickling data with every call this isn't efficient if you need your entire dataset. In the case where you only need a small portion of the data in the client, the method saves a lot of time since the data only needs to be loaded once by the server.The solution is comparable to a database solution speed-wise but it can save you a lot of complexity and DB-learning if you'd prefer to keep to a purely pythonic solution.
Here's some example code that is meant to work with GloVe word vectors.
Server
Client
Note that the
psutil
library is just used to check to see if you have the server running, it's not required. Be sure to name the serverGloVeServer.py
or change the check bypsutil
in the code so it looks for the correct name.