how do we choose --nthreads and --nprocs per worker in Dask distributed? i have 3 workers , with 4 cores and one thread per core on 2 workers and 8 cores on 1 worker (according to the output of 'lscpu' Linux command on each worker)
相关问题
- How to avoid an empty result with `Bag.take(n)` wh
- How should I write multiple CSV files efficiently
- How To Do Model Predict Using Distributed Dask Wit
- Initializing state on dask-distributed workers
- How does Zookeeper manage node roles in other clus
相关文章
- Dask read_csv fails where pandas doesn't
- Managing worker memory on a dask localcluster
- Mapping of elements gone bad
- What happens if the leader is not dead but unable
- spark unix_timestamp data type mismatch
- Remove empty partitions in Dask
- How to create Dask DataFrame from a list of urls?
- filtering with dask read_parquet method gives unwa
It depends on your workload
By default Dask creates a single process with as many threads as you have logical cores on your machine (as determined by
multiprocessing.cpu_count()
).Using few processes and many threads per process is good if you are doing mostly numeric workloads, such as are common in Numpy, Pandas, and Scikit-Learn code, which is not affected by Python's Global Interpreter Lock (GIL).
However, if you are spending most of your compute time manipulating Pure Python objects like strings or dictionaries then you may want to avoid GIL issues by having more processes with fewer threads each
Based on benchmarking you may find that a more balanced split is better
Using more processes avoids GIL issues, but adds costs due to inter-process communication. You would want to avoid many processes if your computations require a lot of inter-worker communication..