Easy way to use parallel options of scikit-learn f

2019-01-21 02:58发布

问题:

In many functions from scikit-learn implemented user-friendly parallelization. For example in sklearn.cross_validation.cross_val_score you just pass desired number of computational jobs in n_jobs argument. And for PC with multi-core processor it will work very nice. But if I want use such option in high performance cluster (with installed OpenMPI package and using SLURM for resource management) ? As I know sklearn uses joblib for parallelization, which uses multiprocessing. And, as I know (from this, for example, Python multiprocessing within mpi) Python programs parallelized with multiprocessing easy to scale oh whole MPI architecture with mpirun utility. Can I spread computation of sklearn functions on several computational nodes just using mpirun and n_jobs argument?

回答1:

SKLearn manages its parallelism with Joblib. Joblib can swap out the multiprocessing backend for other distributed systems like dask.distributed or IPython Parallel. See this issue on the sklearn github page for details.

Example using Joblib with Dask.distributed

Code taken from the issue page linked above.

from distributed.joblib import DistributedBackend 

# it is important to import joblib from sklearn if we want the distributed features to work with sklearn!
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend

... 

search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)

register_parallel_backend('distributed', DistributedBackend)

with parallel_backend('distributed', scheduler_host='your_scheduler_host:your_port'):
        search.fit(digits.data, digits.target)

This requires that you set up a dask.distributed scheduler and workers on your cluster. General instructions are available here: http://distributed.readthedocs.io/en/latest/setup.html

Example using Joblib with ipyparallel

Code taken from the same issue page.

from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend

from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend

digits = load_digits()

c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()

# this is taken from the ipyparallel source code
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))

...

with parallel_backend('ipyparallel'):
        search.fit(digits.data, digits.target)

Note: in both the above examples, the n_jobs parameter seems to not matter anymore.

Set up dask.distributed with SLURM

For SLURM the easiest way to do this is probably to run a dask-scheduler locally

$ dask-scheduler
Scheduler running at 192.168.12.201:8786

And then use SLURM to submit many dask-worker jobs pointing to this process.

$ sbatch --array=0-200 dask-worker 192.168.201:8786 --nthreads 1

(I don't actually know SLURM well, so the syntax above could be incorrect, hopefully the intention is clear)

Use dask.distributed directly

Alternatively you can set up a dask.distributed or IPyParallel cluster and then use these interfaces directly to parallelize your SKLearn code. Here is an example video of SKLearn and Joblib developer Olivier Grisel, doing exactly that at PyData Berlin: https://youtu.be/Ll6qWDbRTD0?t=1561

Try dklearn

You could also try the experimental dklearn package, which has a RandomizedSearchCV object that is API compatible with scikit-learn but computationally implemented on top of Dask

https://github.com/dask/dask-learn

pip install git+https://github.com/dask/dask-learn