Better alternative to pmap in Clojure for parallel

Using clojure I have a very large amount of data in a sequence and I want to process it in parallel, with a relatively small number of cores (4 to 8).

The easiest thing to do is use pmap instead of map, to map my processing function over the sequence of data. But the coordination overhead results in a net loss in my case.

I think the reason is that pmap assumes the function mapped across the data is very costly. Looking at pmap's source code it appears to construct a future for each element of the sequence in turn so each invocation of the function occurs on a separate thread (cycling over the number of available cores).

Here is the relevant piece of pmap's source:

(defn pmap
  "Like map, except f is applied in parallel. Semi-lazy in that the
  parallel computation stays ahead of the consumption, but doesn't
  realize the entire result unless required. Only useful for
  computationally intensive functions where the time of f dominates
  the coordination overhead."
  ([f coll]
   (let [n (+ 2 (.. Runtime getRuntime availableProcessors))
         rets (map #(future (f %)) coll)
         step (fn step [[x & xs :as vs] fs]
                (lazy-seq
                 (if-let [s (seq fs)]
                   (cons (deref x) (step xs (rest s)))
                   (map deref vs))))]
     (step rets (drop n rets))))
  ;; multi-collection form of pmap elided

In my case the mapped function is not that expensive but sequence is huge (millions of records). I think the cost of creating and dereferencing that many futures is where the parallel gain is lost in overhead.

Is my understanding of pmap correct?

Is there a better pattern in clojure for this sort of lower cost but massively repeated processing than pmap? I am considering chunking the data sequence somehow and then running threads on larger chunks. Is this a reasonable approach and what clojure idioms would work?

标签： clojure parallel-processing

4条回答

ら.Afraid

2楼-- · 2020-02-17 09:29

Sadly not a valid answer yet, but something to watch for in the future is Rich's work with the fork/join library coming in Java 7. If you look at his Par branch on github he's done some work with it, and last I had seen the early returns were amazing.

Example of Rich trying it out.

http://paste.lisp.org/display/84027

0人赞添加讨论(0) 举报

小情绪 Triste *

3楼-- · 2020-02-17 09:39

You can use some sort of map/reduce implemented by hand. Also take a look at swarmiji framework.

"A distributed computing system that helps writing and running Clojure code in parallel - across cores and processors"

0人赞添加讨论(0) 举报

姐就是有狂的资本

4楼-- · 2020-02-17 09:44

The fork/join work mentioned in earlier answers on this and similar threads eventually bore fruit as the reducers library, which is probably worth a look.

0人赞添加讨论(0) 举报

爷的心禁止访问

5楼-- · 2020-02-17 09:47

This question: how-to-efficiently-apply-a-medium-weight-function-in-parallel also addresses this problem in a very similar context.

The current best answer is to use partition to break it into chunks. then pmap a map function onto each chunk. then recombine the results. map-reduce-style.

0人赞添加讨论(0) 举报

Better alternative to pmap in Clojure for parallel

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间