I have a complex python object, of size ~36GB in memory, which I would like to share between multiple separate python processes. It is stored on disk as a pickle file, which I currently load separately for every process. I want to share this object to enable execution of more processes in parallel, under the amount of memory available.
This object is used, in a sense, as a read-only database. Every process initiates multiple access requests per second, and every request is just for a small portion of the data.
I looked into solutions like Radis, but I saw that eventually, the data needs to be serialized into a simple textual form. Also, mapping the pickle file itself to memory should not help because it will need to be extracted by every process. So I thought about two other possible solutions:
- Using a shared memory, where every process can access the address in which the object is stored. The problem here is that the process will only see a bulk of bytes, which cannot be interpreted
- Writing a code that holds this object and manages retrieval of data, through API calls. Here, I wonder about the performance of such solution in terms of speed.
Is there a simple way to implement either of these solutions? Perhaps there is a better solution for this situation?
Many thanks!
For complex objects there isn't readily available method to directly share memory between processes. If you have simple ctypes
you can do this in a c-style shared memory but it won't map directly to python objects.
There is a simple solution that works well if you only need a portion of your data at any one time, not the entire 36GB. For this you can use a SyncManager
from multiprocessing.managers
. Using this, you setup a server that serves up a proxy class for your data (your data isn't stored in the class, the proxy only provides access to it). Your client then attaches to the server using a BaseManager
and calls methods in the proxy class to retrieve the data.
Behind the scenes the Manager
classes take care of pickling the data you ask for and sending it through the open port from server to client. Because you're pickling data with every call this isn't efficient if you need your entire dataset. In the case where you only need a small portion of the data in the client, the method saves a lot of time since the data only needs to be loaded once by the server.
The solution is comparable to a database solution speed-wise but it can save you a lot of complexity and DB-learning if you'd prefer to keep to a purely pythonic solution.
Here's some example code that is meant to work with GloVe word vectors.
Server
#!/usr/bin/python
import sys
from multiprocessing.managers import SyncManager
import numpy
# Global for storing the data to be served
gVectors = {}
# Proxy class to be shared with different processes
# Don't but the big vector data in here since that will force it to
# be piped to the other process when instantiated there, instead just
# return the global vector data, from this process, when requested.
class GloVeProxy(object):
def __init__(self):
pass
def getNVectors(self):
global gVectors
return len(gVectors)
def getEmpty(self):
global gVectors
return numpy.zeros_like(gVectors.values()[0])
def getVector(self, word, default=None):
global gVectors
return gVectors.get(word, default)
# Class to encapsulate the server functionality
class GloVeServer(object):
def __init__(self, port, fname):
self.port = port
self.load(fname)
# Load the vectors into gVectors (global)
@staticmethod
def load(filename):
global gVectors
f = open(filename, 'r')
for line in f:
vals = line.rstrip().split(' ')
gVectors[vals[0]] = numpy.array(vals[1:]).astype('float32')
# Run the server
def run(self):
class myManager(SyncManager): pass
myManager.register('GloVeProxy', GloVeProxy)
mgr = myManager(address=('', self.port), authkey='GloVeProxy01')
server = mgr.get_server()
server.serve_forever()
if __name__ == '__main__':
port = 5010
fname = '/mnt/raid/Data/Misc/GloVe/WikiGiga/glove.6B.50d.txt'
print 'Loading vector data'
gs = GloVeServer(port, fname)
print 'Serving data. Press <ctrl>-c to stop.'
gs.run()
Client
from multiprocessing.managers import BaseManager
import psutil #3rd party module for process info (not strictly required)
# Grab the shared proxy class. All methods in that class will be availble here
class GloVeClient(object):
def __init__(self, port):
assert self._checkForProcess('GloVeServer.py'), 'Must have GloVeServer running'
class myManager(BaseManager): pass
myManager.register('GloVeProxy')
self.mgr = myManager(address=('localhost', port), authkey='GloVeProxy01')
self.mgr.connect()
self.glove = self.mgr.GloVeProxy()
# Return the instance of the proxy class
@staticmethod
def getGloVe(port):
return GloVeClient(port).glove
# Verify the server is running
@staticmethod
def _checkForProcess(name):
for proc in psutil.process_iter():
if proc.name() == name:
return True
return False
if __name__ == '__main__':
port = 5010
glove = GloVeClient.getGloVe(port)
for word in ['test', 'cat', '123456']:
print('%s = %s' % (word, glove.getVector(word)))
Note that the psutil
library is just used to check to see if you have the server running, it's not required. Be sure to name the server GloVeServer.py
or change the check by psutil
in the code so it looks for the correct name.