Sharing a complex python object in memory between

2019-05-22 07:48发布

问题:

I have a complex python object, of size ~36GB in memory, which I would like to share between multiple separate python processes. It is stored on disk as a pickle file, which I currently load separately for every process. I want to share this object to enable execution of more processes in parallel, under the amount of memory available.

This object is used, in a sense, as a read-only database. Every process initiates multiple access requests per second, and every request is just for a small portion of the data.

I looked into solutions like Radis, but I saw that eventually, the data needs to be serialized into a simple textual form. Also, mapping the pickle file itself to memory should not help because it will need to be extracted by every process. So I thought about two other possible solutions:

  1. Using a shared memory, where every process can access the address in which the object is stored. The problem here is that the process will only see a bulk of bytes, which cannot be interpreted
  2. Writing a code that holds this object and manages retrieval of data, through API calls. Here, I wonder about the performance of such solution in terms of speed.

Is there a simple way to implement either of these solutions? Perhaps there is a better solution for this situation?

Many thanks!

回答1:

For complex objects there isn't readily available method to directly share memory between processes. If you have simple ctypes you can do this in a c-style shared memory but it won't map directly to python objects.

There is a simple solution that works well if you only need a portion of your data at any one time, not the entire 36GB. For this you can use a SyncManager from multiprocessing.managers. Using this, you setup a server that serves up a proxy class for your data (your data isn't stored in the class, the proxy only provides access to it). Your client then attaches to the server using a BaseManager and calls methods in the proxy class to retrieve the data.

Behind the scenes the Manager classes take care of pickling the data you ask for and sending it through the open port from server to client. Because you're pickling data with every call this isn't efficient if you need your entire dataset. In the case where you only need a small portion of the data in the client, the method saves a lot of time since the data only needs to be loaded once by the server.

The solution is comparable to a database solution speed-wise but it can save you a lot of complexity and DB-learning if you'd prefer to keep to a purely pythonic solution.

Here's some example code that is meant to work with GloVe word vectors.

Server

#!/usr/bin/python
import  sys
from    multiprocessing.managers import SyncManager
import  numpy

# Global for storing the data to be served
gVectors = {}

# Proxy class to be shared with different processes
# Don't but the big vector data in here since that will force it to 
# be piped to the other process when instantiated there, instead just
# return the global vector data, from this process, when requested.
class GloVeProxy(object):
    def __init__(self):
        pass

    def getNVectors(self):
        global gVectors
        return len(gVectors)

    def getEmpty(self):
        global gVectors
        return numpy.zeros_like(gVectors.values()[0])

    def getVector(self, word, default=None):
        global gVectors
        return gVectors.get(word, default)

# Class to encapsulate the server functionality
class GloVeServer(object):
    def __init__(self, port, fname):
        self.port = port
        self.load(fname)

    # Load the vectors into gVectors (global)
    @staticmethod
    def load(filename):
        global gVectors
        f = open(filename, 'r')
        for line in f:
            vals = line.rstrip().split(' ')
            gVectors[vals[0]] = numpy.array(vals[1:]).astype('float32')

    # Run the server
    def run(self):
        class myManager(SyncManager): pass  
        myManager.register('GloVeProxy', GloVeProxy)
        mgr = myManager(address=('', self.port), authkey='GloVeProxy01')
        server = mgr.get_server()
        server.serve_forever()

if __name__ == '__main__':
    port  = 5010
    fname = '/mnt/raid/Data/Misc/GloVe/WikiGiga/glove.6B.50d.txt'

    print 'Loading vector data'
    gs = GloVeServer(port, fname)

    print 'Serving data. Press <ctrl>-c to stop.'
    gs.run()

Client

from   multiprocessing.managers import BaseManager
import psutil   #3rd party module for process info (not strictly required)

# Grab the shared proxy class.  All methods in that class will be availble here
class GloVeClient(object):
    def __init__(self, port):
        assert self._checkForProcess('GloVeServer.py'), 'Must have GloVeServer running'
        class myManager(BaseManager): pass
        myManager.register('GloVeProxy')
        self.mgr = myManager(address=('localhost', port), authkey='GloVeProxy01')
        self.mgr.connect()
        self.glove = self.mgr.GloVeProxy()

    # Return the instance of the proxy class
    @staticmethod
    def getGloVe(port):
        return GloVeClient(port).glove

    # Verify the server is running
    @staticmethod
    def _checkForProcess(name):
        for proc in psutil.process_iter():
            if proc.name() == name:
                return True
        return False

if __name__ == '__main__':
    port = 5010
    glove = GloVeClient.getGloVe(port)

    for word in ['test', 'cat', '123456']:
        print('%s = %s' % (word, glove.getVector(word)))

Note that the psutil library is just used to check to see if you have the server running, it's not required. Be sure to name the server GloVeServer.py or change the check by psutil in the code so it looks for the correct name.