ipython notebook : how to parallelize external scr

2019-03-30 08:04发布

问题:

I'm trying to use parallel computing from ipython parallel library. But I have little knowledge about it and I find the doc difficult to read from someone who knows nothing about parallel computing.

Funnily, all tutorials I found just re-use the example in the doc, with the same explanation, which from my point of view, is useless.

Basically what I'd like to do is running few scripts in background so they are executed in the same time. In bash it would be something like :

for my_file in $(cat list_file); do
    python pgm.py my_file &
done

But bash interpreter of Ipython notebook doesn't handle the background mode.

It seems that solution was to use parallel library from ipython.

I tried :

from IPython.parallel import Client
rc = Client()
rc.block = True
dview = rc[:2] # I take only 2 engines

But then I'm stuck. I don't know how to run twice (or more) the same script or pgm at the same time.

Thanks.

回答1:

One year later, I eventually managed to get what I wanted.

1) Create a function with what you want to do on the different cpu. Here it is just calling a script from the bash with the ! magic ipython command. I guess it would work with the call() function.

def my_func(my_file):
    !python pgm.py {my_file}

Don't forget the {} when using !

Note also that the path to my_file should be absolute, since the clusters are where you started the notebook (when doing jupyter notebook or ipython notebook) which is not necessarily where you are.

2) Start your ipython notebook Cluster with the number of CPU you want. Wait 2s and execute the following cell:

from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()

3) Get a list of file you want to process:

files = list_of_files

4) Map asynchronously your function with all your files to the view of your engines you just created. (not sure of the wording).

r = view.map_async(my_func, files)

While it's running you can do something else on the notebook (It runs in "background"!). You can also call r.wait_interactive() that enumerates interactively the number of files processed and the number of time spent so far and the number of files left. This will prevent you to run other cells (but you can interrupt it).

And if you have more files than engines, no worries, they will be processed as soon as an engine finishes with 1 file.

Hope this will help others !

This tutorial might be of some help:

http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb

Note also that I still have IPython 2.3.1, I don't know if it changed since Jupyter.

Edit: Still works with Jupyter, see here for difference and potential issues you may encounter


Note that if you use external libraries in your function, you need to import them on the different engines with:

%px import numpy as np

or

%%px
import numpy as np
import pandas as pd

Same with variable and other functions, you need to push them to the engine name space:

rc[:].push(dict(
                foo=foo,
                bar=bar))



回答2:

If you're trying to executing some external scripts in parallel, you don't need to use IPython's parallel functionality. Replicating bash's parallel execution can be achieved with the subprocess module as follows:

import subprocess

procs = []
for i in range(10):
    procs.append(subprocess.Popen(['ls', '/Users/shad/tmp/'], stdout=subprocess.PIPE))

results = []
for proc in procs:
    stdout, _ = proc.communicate()
    results.append(stdout)

Be wary that if your subprocess generates a lot of output, the process will block. If you print the output (results) you get:

print results

['file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n']