蟒多处理存储数据,直到在每个处理另一个呼叫(Python Multiprocessing stori

2019-07-18 11:58发布

我有不能在进程之间共享的类型的大对象。 它有方法来初始化它,并在它的数据。

我做它目前的方式是我第一次实例化对象在主父进程,然后当一些事件发生围绕它传递给子进程。 问题是,每当子进程运行,他们每天这需要一段时间复制的对象在内存中。 我想将它存储在内存中,这只是提供给他们,使他们不必每次调用该对象的函数时复制它。

我将如何存储对象只是这个过程自己用呢?

编辑:代码

class MultiQ:
    def __init__(self):
        self.pred = instantiate_predict() #here I instantiate the big object

    def enq_essay(self,essay):
        p = Process(target=self.compute_results, args=(essay,))
        p.start()

    def compute_results(self, essay):
        predictions = self.pred.predict_fields(essay) #computation in the large object that doesn't modify the object

这个副本每次大对象在内存中。 我想避免这种情况。

编辑4:短代码示例,在20个新闻组数据运行

import sklearn.feature_extraction.text as ftext
import sklearn.linear_model as lm
import multiprocessing as mp
import logging
import os
import numpy as np
import cPickle as pickle


def get_20newsgroups_fnames():
    all_files = []
    for i, (root, dirs, files) in enumerate(os.walk("/home/roman/Desktop/20_newsgroups/")):
        if i>0:
            all_files.extend([os.path.join(root,file) for file in files])
    return all_files

documents = [unicode(open(f).read(), errors="ignore") for f in get_20newsgroups_fnames()]
logger = mp.get_logger()
formatter = logging.Formatter('%(asctime)s: [%(processName)12s] %(message)s',
                              datefmt = '%H:%M:%S')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.WARNING)
mp._log_to_stderr = True


def free_memory():
    """
    Return free memory available, including buffer and cached memory
    """
    total = 0
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            line = line.strip()
            if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
                field, amount, unit = line.split()
                amount = int(amount)
                if unit != 'kB':
                    raise ValueError(
                        'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
                total += amount
    return total


def predict(large_object, essay="this essay will be predicted"):
    """this method copies the large object in memory which is what im trying to avoid"""
    vectorized_essay = large_object[0].transform(essay)
    large_object[1].predict(vectorized_essay)
    report_memory("done")


def train_and_model():
    """this is very similar to the instantiate_predict method from my first code sample"""
    tfidf_vect = ftext.TfidfVectorizer()
    X = tfidf_vect.fit_transform(documents)
    y = np.random.random_integers(0,1,19997)
    model = lm.LogisticRegression()
    model.fit(X, y)
    return (tfidf_vect, model)


def report_memory(label):
    f = free_memory()
    logger.warn('{l:<25}: {f}'.format(f=f, l=label))

def dump_large_object(large_object):
    f = open("large_object.obj", "w")
    pickle.dump(large_object, f, protocol=2)
    f.close()

def load_large_object():
    f = open("large_object.obj")
    large_object = pickle.load(f)
    f.close()
    return large_object

if __name__ == '__main__':
    report_memory('Initial')
    tfidf_vect, model = train_and_model()
    report_memory('After train_and_model')
    large_object = (tfidf_vect, model)
    procs = [mp.Process(target=predict, args=(large_object,))
             for i in range(mp.cpu_count())]
    report_memory('After Process')
    for p in procs:
        p.start()
    report_memory('After p.start')
    for p in procs:
        p.join()
    report_memory('After p.join')

输出1:

19:01:39: [ MainProcess] Initial                  : 26585728
19:01:51: [ MainProcess] After train_and_model    : 25958924
19:01:51: [ MainProcess] After Process            : 25958924
19:01:51: [ MainProcess] After p.start            : 25925908
19:01:51: [   Process-1] done                     : 25725524
19:01:51: [   Process-2] done                     : 25781076
19:01:51: [   Process-4] done                     : 25789880
19:01:51: [   Process-3] done                     : 25802032
19:01:51: [ MainProcess] After p.join             : 25958272
roman@ubx64:$ du -h large_object.obj
4.6M    large_object.obj

所以,也许大对象甚至不是很大,我的问题是,从TFIDF矢量化的变换方法的内存使用情况。

现在如果我的主要方法改成这样:

report_memory('Initial')
large_object = load_large_object()
report_memory('After loading the object')
procs = [mp.Process(target=predict, args=(large_object,))
         for i in range(mp.cpu_count())]
report_memory('After Process')
for p in procs:
    p.start()
report_memory('After p.start')
for p in procs:
    p.join()
report_memory('After p.join')

我得到这些结果: 输出2:

20:07:23: [ MainProcess] Initial                  : 26578356
20:07:23: [ MainProcess] After loading the object : 26544380
20:07:23: [ MainProcess] After Process            : 26544380
20:07:23: [ MainProcess] After p.start            : 26523268
20:07:24: [   Process-1] done                     : 26338012
20:07:24: [   Process-4] done                     : 26337268
20:07:24: [   Process-3] done                     : 26439444
20:07:24: [   Process-2] done                     : 26438948
20:07:24: [ MainProcess] After p.join             : 26542860

然后,我改变了主要方法如下:

report_memory('Initial')
large_object = load_large_object()
report_memory('After loading the object')
predict(large_object)
report_memory('After Process')

并得到了以下结果: 输出3:

20:13:34: [ MainProcess] Initial                  : 26572580
20:13:35: [ MainProcess] After loading the object : 26538356
20:13:35: [ MainProcess] done                     : 26513804
20:13:35: [ MainProcess] After Process            : 26513804

在这一点上我不知道发生了什么事情,但肯定多使用更多的内存。

Answer 1:

Linux使用写入时复制 ,这意味着当一个子进程分叉,直到值被修改在每一个全局变量子共享相同的内存地址。 只有当被修改的值是其复制。

所以理论上,如果大对象没有被修改,它可以通过子过程,而无需耗费更多的内存使用。 让我们测试这一理论。

这里是你的代码,焕然一新带着几分内存使用的日志记录:

import sklearn.feature_extraction.text as ftext
import sklearn.linear_model as lm
import multiprocessing as mp
import logging

logger = mp.get_logger()
formatter = logging.Formatter('%(asctime)s: [%(processName)12s] %(message)s',
                              datefmt='%H:%M:%S')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.WARNING)
mp._log_to_stderr = True


def predict(essay="this essay will be predicted"):
    """this method copies the large object in memory which is what im trying to avoid"""
    vectorized_essay = large_object[0].transform(essay)
    large_object[1].predict(vectorized_essay)
    report_memory("done")


def train_and_model():
    """this is very similar to the instantiate_predict method from my first code sample"""
    tfidf_vect = ftext.TfidfVectorizer()
    N = 100000
    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?', ] * N
    y = [1, 0, 1, 0] * N
    report_memory('Before fit_transform')
    X = tfidf_vect.fit_transform(corpus)
    model = lm.LogisticRegression()
    model.fit(X, y)
    report_memory('After model.fit')
    return (tfidf_vect, model)


def free_memory():
    """
    Return free memory available, including buffer and cached memory
    """
    total = 0
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            line = line.strip()
            if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
                field, amount, unit = line.split()
                amount = int(amount)
                if unit != 'kB':
                    raise ValueError(
                        'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
                total += amount
    return total


def gen_change_in_memory():
    f = free_memory()
    diff = 0
    while True:
        yield diff
        f2 = free_memory()
        diff = f - f2
        f = f2
change_in_memory = gen_change_in_memory().next

def report_memory(label):
    logger.warn('{l:<25}: {d:+d}'.format(d=change_in_memory(), l=label))

if __name__ == '__main__':
    report_memory('Initial')
    tfidf_vect, model = train_and_model()
    report_memory('After train_and_model')
    large_object = (tfidf_vect, model)
    procs = [mp.Process(target=predict) for i in range(mp.cpu_count())]
    report_memory('After Process')
    for p in procs:
        p.start()
    for p in procs:
        p.join()
    report_memory('After p.join')

它产生:

21:45:01: [ MainProcess] Initial                  : +0
21:45:01: [ MainProcess] Before fit_transform     : +3224
21:45:12: [ MainProcess] After model.fit          : +153572
21:45:12: [ MainProcess] After train_and_model    : -3100
21:45:12: [ MainProcess] After Process            : +0
21:45:12: [   Process-1] done                     : +2232
21:45:12: [   Process-2] done                     : +2976
21:45:12: [   Process-3] done                     : +3596
21:45:12: [   Process-4] done                     : +3224
21:45:12: [ MainProcess] After p.join             : -372

所报告的数量是在空闲内存KIB的变化(包括高速缓存和缓冲液)。 因此,例如,在两者之间自由内存的变化“初始”和“train_and_model后”约为150MB。 因此, large_object需要大约150MB。

然后,4子过程完成后,内存少得多 - 大约12MB总 - 已被消耗。 消耗的内存可能是由于创设一个由所使用的子加内存的transformpredict方法。

所以看来, large_object不被复制,因为如果是我们应该看到消耗内存增加了约150MB的。


评论你的20个新闻组运行

下面是自由内存的变化:

5月20个新闻组数据:

| Initial               |       0 |
| After train_and_model |  626804 | <-- Large object requires 627M
| After Process         |       0 |
| After p.start         |   33016 |
| done                  |  200384 | 
| done                  |  -55552 |
| done                  |   -8804 |
| done                  |  -12152 |
| After p.join          | -156240 |

所以它看起来像实例化的大对象需要627MB。 我无能,为什么额外的200 + MB已在第一之后被消耗done为止。

使用load_large_object:

| Initial                  |       0 |
| After loading the object |   33976 |
| After Process            |       0 |
| After p.start            |   21112 |
| done                     |  185256 |
| done                     |     744 |
| done                     | -102176 |
| done                     |     496 |
| After p.join             | -103912 |

显然,large_object本身只需要34MB,内存的其余部分,627-34 = 593MB一定是被消耗fit_transformfit堪称方法train_and_model

使用单一的过程:

| Initial                  |     0 |
| After loading the object | 34224 |
| done                     | 24552 |
| After Process            |     0 |

这是有道理的。

所以,你已经积累的数据似乎支持这种说法,大型对象本身不是由每个子复制。 但是,一个新出现的神秘:为什么会出现一个巨大的“p.start后”和第一个“完成”之间的内存消耗。 我不知道这个问题的答案。


你可以尝试把report_memory身边召唤

vectorized_essay = large_object[0].transform(essay)

large_object[1].predict(vectorized_essay)

看到额外的内存被消耗在哪里。 我的猜测是,这些scikit学习方法之一是选择分配的内存这(相对)量巨大。



Answer 2:

最后我用用兔MQ的RPC服务器。 兔MQ教程RPC / Python的 。 所以,我创建了相当于我的机器上的CPU数量的服务器的数量。 这些服务器启动一次,一次分配的模型内存和向量化,并抓住它,而运营。 另外的优势是

  1. 一些处理的可能,如果一个被淹没轻松发送到不同的机器
  2. 如果计算一台服务器上失败,它可以很容易地发送到不同的服务器
  3. 内存分配过程是不是在原来的代码瞬间,所以总体运行时间在我的数据集从18秒下降到每查询12秒因为内存预分配。

总体来说,我的代码变得更清洁,以及。



文章来源: Python Multiprocessing storing data until further call in each process