Parallel Haskell - GHC GC'ing sparks

2020-07-10 03:36发布

问题:

I have a program I'm trying to parallelize (full paste with runnable code here).

I've profiled and found that the majority of time is spent in findNearest which is essentially a simple foldr over a large Data.Map.

findNearest :: RGB -> M.Map k RGB -> (k, Word32)
findNearest rgb m0 =
    M.foldrWithKey' minDistance (k0, distance rgb r0) m0
    where (k0, r0) = M.findMin m0
          minDistance k r x@(_, d1) =
            -- Euclidean distance in RGB-space
            let d0 = distance rgb r
            in if d0 < d1 then (k, d0) else x

parFindNearest is supposed to execute findNearest in parallel over subtrees of the larger Map.

parFindNearest :: NFData k => RGB -> M.Map k RGB -> (k, Word32)
parFindNearest rgb = minimumBy (comparing snd)
                   . parMap rdeepseq (findNearest rgb)
                   . M.splitRoot

Unfortunately GHC GC's most my sparks before they are converted into useful parallelism.

Here's the result of compiling with ghc -O2 -threaded and running with +RTS -s -N2

 839,892,616 bytes allocated in the heap
 123,999,464 bytes copied during GC
   5,320,184 bytes maximum residency (19 sample(s))
   3,214,200 bytes maximum slop
          16 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1550 colls,  1550 par    0.23s    0.11s     0.0001s    0.0004s
  Gen  1        19 colls,    18 par    0.11s    0.06s     0.0030s    0.0052s

  Parallel GC work balance: 16.48% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 215623 (1318 converted, 0 overflowed, 0 dud, 198111 GC'd, 16194 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    3.72s  (  3.66s elapsed)
  GC      time    0.34s  (  0.17s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    4.07s  (  3.84s elapsed)

  Alloc rate    225,726,318 bytes per MUT second

  Productivity  91.6% of total user, 97.1% of total elapsed

gc_alloc_block_sync: 9862
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2103

As you can see, the majority of sparks are GC'd or fizzle before being converted. I've tried experimenting with different strictness, having findNearest return a custom strict pair data type instead of a tuple , or using rdeepseq from Control.Parallel.Strategies, but my sparks are still GC'd.

I'd like to know

  • why are my sparks being GC'd before being converted?
  • how can I change my program to take advantage of parallelism?

回答1:

I'm not at expert in parallel strategies, so I may be completely wrong. But:

If you disable GC by setting big enough allocation area (e.g. using -A20M runtime option), you'll see that most of sparks are fizzled, not GC'd. It means they are evaluated by ordinary program flow before the corresponding spark finished.

minimumBy forces parMap results immediately, starting evaluating them. At the same time, sparks are scheduled and executed, but it is too late. When spark finished, the value is already evaluated by the main thread. Without -A20M, sparks are GC'd because the value is evaluated and GC'd even before the spark is scheduled.

Here is a simplified test case:

import Control.Parallel.Strategies

f :: Integer -> Integer
f 0 = 1
f n = n * f (n - 1)

main :: IO ()
main = do
  let l = [n..n+10]
      n = 1
      res = parMap rdeepseq f l
  print res

In that case all the sparks are fizzled:

 SPARKS: 11 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 11 fizzled)

(Some times they are GC'd)

But if I yield main thread before printing results,

import Control.Parallel.Strategies
import Control.Concurrent

f :: Integer -> Integer
f 0 = 1
f n = n * f (n - 1)

main :: IO ()
main = do
  let l = [n..n+10]
      n = 1
      res = parMap rdeepseq f l
  res `seq` threadDelay 1
  print res

Then all the sparks are converted:

SPARKS: 11 (11 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

So, looks like you have not enough sparks (try to set l = [n..n+1000] in my example), and they are not heavy enough (try to set n = 1000 in my example).