我有不能在进程之间共享的类型的大对象。 它有方法来初始化它,并在它的数据。
我做它目前的方式是我第一次实例化对象在主父进程,然后当一些事件发生围绕它传递给子进程。 问题是,每当子进程运行,他们每天这需要一段时间复制的对象在内存中。 我想将它存储在内存中,这只是提供给他们,使他们不必每次调用该对象的函数时复制它。
我将如何存储对象只是这个过程自己用呢?
编辑:代码
class MultiQ:
def __init__(self):
self.pred = instantiate_predict() #here I instantiate the big object
def enq_essay(self,essay):
p = Process(target=self.compute_results, args=(essay,))
p.start()
def compute_results(self, essay):
predictions = self.pred.predict_fields(essay) #computation in the large object that doesn't modify the object
这个副本每次大对象在内存中。 我想避免这种情况。
编辑4:短代码示例,在20个新闻组数据运行
import sklearn.feature_extraction.text as ftext
import sklearn.linear_model as lm
import multiprocessing as mp
import logging
import os
import numpy as np
import cPickle as pickle
def get_20newsgroups_fnames():
all_files = []
for i, (root, dirs, files) in enumerate(os.walk("/home/roman/Desktop/20_newsgroups/")):
if i>0:
all_files.extend([os.path.join(root,file) for file in files])
return all_files
documents = [unicode(open(f).read(), errors="ignore") for f in get_20newsgroups_fnames()]
logger = mp.get_logger()
formatter = logging.Formatter('%(asctime)s: [%(processName)12s] %(message)s',
datefmt = '%H:%M:%S')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.WARNING)
mp._log_to_stderr = True
def free_memory():
"""
Return free memory available, including buffer and cached memory
"""
total = 0
with open('/proc/meminfo', 'r') as f:
for line in f:
line = line.strip()
if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
field, amount, unit = line.split()
amount = int(amount)
if unit != 'kB':
raise ValueError(
'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
total += amount
return total
def predict(large_object, essay="this essay will be predicted"):
"""this method copies the large object in memory which is what im trying to avoid"""
vectorized_essay = large_object[0].transform(essay)
large_object[1].predict(vectorized_essay)
report_memory("done")
def train_and_model():
"""this is very similar to the instantiate_predict method from my first code sample"""
tfidf_vect = ftext.TfidfVectorizer()
X = tfidf_vect.fit_transform(documents)
y = np.random.random_integers(0,1,19997)
model = lm.LogisticRegression()
model.fit(X, y)
return (tfidf_vect, model)
def report_memory(label):
f = free_memory()
logger.warn('{l:<25}: {f}'.format(f=f, l=label))
def dump_large_object(large_object):
f = open("large_object.obj", "w")
pickle.dump(large_object, f, protocol=2)
f.close()
def load_large_object():
f = open("large_object.obj")
large_object = pickle.load(f)
f.close()
return large_object
if __name__ == '__main__':
report_memory('Initial')
tfidf_vect, model = train_and_model()
report_memory('After train_and_model')
large_object = (tfidf_vect, model)
procs = [mp.Process(target=predict, args=(large_object,))
for i in range(mp.cpu_count())]
report_memory('After Process')
for p in procs:
p.start()
report_memory('After p.start')
for p in procs:
p.join()
report_memory('After p.join')
输出1:
19:01:39: [ MainProcess] Initial : 26585728
19:01:51: [ MainProcess] After train_and_model : 25958924
19:01:51: [ MainProcess] After Process : 25958924
19:01:51: [ MainProcess] After p.start : 25925908
19:01:51: [ Process-1] done : 25725524
19:01:51: [ Process-2] done : 25781076
19:01:51: [ Process-4] done : 25789880
19:01:51: [ Process-3] done : 25802032
19:01:51: [ MainProcess] After p.join : 25958272
roman@ubx64:$ du -h large_object.obj
4.6M large_object.obj
所以,也许大对象甚至不是很大,我的问题是,从TFIDF矢量化的变换方法的内存使用情况。
现在如果我的主要方法改成这样:
report_memory('Initial')
large_object = load_large_object()
report_memory('After loading the object')
procs = [mp.Process(target=predict, args=(large_object,))
for i in range(mp.cpu_count())]
report_memory('After Process')
for p in procs:
p.start()
report_memory('After p.start')
for p in procs:
p.join()
report_memory('After p.join')
我得到这些结果: 输出2:
20:07:23: [ MainProcess] Initial : 26578356
20:07:23: [ MainProcess] After loading the object : 26544380
20:07:23: [ MainProcess] After Process : 26544380
20:07:23: [ MainProcess] After p.start : 26523268
20:07:24: [ Process-1] done : 26338012
20:07:24: [ Process-4] done : 26337268
20:07:24: [ Process-3] done : 26439444
20:07:24: [ Process-2] done : 26438948
20:07:24: [ MainProcess] After p.join : 26542860
然后,我改变了主要方法如下:
report_memory('Initial')
large_object = load_large_object()
report_memory('After loading the object')
predict(large_object)
report_memory('After Process')
并得到了以下结果: 输出3:
20:13:34: [ MainProcess] Initial : 26572580
20:13:35: [ MainProcess] After loading the object : 26538356
20:13:35: [ MainProcess] done : 26513804
20:13:35: [ MainProcess] After Process : 26513804
在这一点上我不知道发生了什么事情,但肯定多使用更多的内存。