In many functions from scikit-learn implemented user-friendly parallelization. For example in
sklearn.cross_validation.cross_val_score
you just pass desired number of computational jobs in n_jobs
argument. And for PC with multi-core processor it will work very nice. But if I want use such option in high performance cluster (with installed OpenMPI package and using SLURM for resource management) ? As I know sklearn
uses joblib
for parallelization, which uses multiprocessing
. And, as I know (from this, for example, Python multiprocessing within mpi) Python programs parallelized with multiprocessing
easy to scale oh whole MPI architecture with mpirun
utility. Can I spread computation of sklearn
functions on several computational nodes just using mpirun
and n_jobs
argument?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
SKLearn manages its parallelism with Joblib. Joblib can swap out the multiprocessing backend for other distributed systems like dask.distributed or IPython Parallel. See this issue on the
sklearn
github page for details.Example using Joblib with Dask.distributed
Code taken from the issue page linked above.
This requires that you set up a
dask.distributed
scheduler and workers on your cluster. General instructions are available here: http://distributed.readthedocs.io/en/latest/setup.htmlExample using Joblib with
ipyparallel
Code taken from the same issue page.
Note: in both the above examples, the
n_jobs
parameter seems to not matter anymore.Set up dask.distributed with SLURM
For SLURM the easiest way to do this is probably to run a
dask-scheduler
locallyAnd then use SLURM to submit many
dask-worker
jobs pointing to this process.(I don't actually know SLURM well, so the syntax above could be incorrect, hopefully the intention is clear)
Use dask.distributed directly
Alternatively you can set up a dask.distributed or IPyParallel cluster and then use these interfaces directly to parallelize your SKLearn code. Here is an example video of SKLearn and Joblib developer Olivier Grisel, doing exactly that at PyData Berlin: https://youtu.be/Ll6qWDbRTD0?t=1561
Try
dklearn
You could also try the experimental
dklearn
package, which has aRandomizedSearchCV
object that is API compatible with scikit-learn but computationally implemented on top of Daskhttps://github.com/dask/dask-learn