Controlling memory allocation/GC in a simulation?

2020-07-15 10:19发布

问题:

I'm having a bit of trouble figuring out how to reduce memory usage and GC time in a simulation running in the State monad. Presently I have to run the compiled code with +RTS -K100M to avoid stack space overflow, and the GC stats are pretty hideous (see below).

Here are relevant snippets of the code. Complete, working (GHC 7.4.1) code can be found at http://hpaste.org/68527.

-- Lone algebraic data type holding the simulation configuration.
data SimConfig = SimConfig {
        numDimensions :: !Int            -- strict
    ,   numWalkers    :: !Int            -- strict
    ,   simArray      :: IntMap [Double] -- strict spine
    ,   logP          :: Seq Double      -- strict spine
    ,   logL          :: Seq Double      -- strict spine
    ,   pairStream    :: [(Int, Int)]    -- lazy (infinite) list of random vals
    ,   doubleStream  :: [Double]        -- lazy (infinite) list of random vals
    } deriving Show

-- The transition kernel for the simulation.
simKernel :: State SimConfig ()
simKernel = do
    config <- get
    let arr   = simArray      config
    let n     = numWalkers    config
    let d     = numDimensions config
    let rstm0 = pairStream    config
    let rstm1 = doubleStream  config
    let lp    = logP          config
    let ll    = logL          config

    let (a, b)    = head rstm0                           -- uses random stream    
    let z0 = head . map affineTransform $ take 1 rstm1   -- uses random stream
            where affineTransform a = 0.5 * (a + 1) ^ 2


    let proposal  = zipWith (+) r1 r2
            where r1    = map (*z0)     $ fromJust (IntMap.lookup a arr)
                  r2    = map (*(1-z0)) $ fromJust (IntMap.lookup b arr)

    let logA = if val > 0 then 0 else val
            where val = logP_proposal + logL_proposal - (lp `index` (a - 1)) - (ll `index` (a - 1)) + ((fromIntegral n - 1) * log z0)
                  logP_proposal = logPrior proposal
                  logL_proposal = logLikelihood proposal

    let cVal       = (rstm1 !! 1) <= exp logA            -- uses random stream

    let newConfig = SimConfig { simArray = if   cVal
                                           then IntMap.update (\_ -> Just proposal) a arr
                                           else arr
                              , numWalkers = n
                              , numDimensions = d
                              , pairStream   = drop 1 rstm0
                              , doubleStream = drop 2 rstm1
                              , logP = if   cVal
                                       then Seq.update (a - 1) (logPrior proposal) lp
                                       else lp
                              , logL = if   cVal
                                       then Seq.update (a - 1) (logLikelihood proposal) ll
                                       else ll
                              }

    put newConfig

main = do 
    -- (some stuff omitted)
    let sim = logL $ (`execState` initConfig) . replicateM 100000 $ simKernel
    print sim

In terms of the heap, a profile seems to cue that the System.Random functions, in addition to (,), are memory culprits. I can't include an image directly, but you can see a heap profile here: http://i.imgur.com/5LKxX.png.

I have no idea how to reduce the presence of those things any further. The random variates are generated outside the State monad (to avoid splitting the generator on every iteration), and I believe the only instance of (,) inside simKernel arises when plucking a pair from the lazy list (pairStream) that is included in the simulation configuration.

The stats, including GC, are as follows:

  1,220,911,360 bytes allocated in the heap
     787,192,920 bytes copied during GC
     186,821,752 bytes maximum residency (10 sample(s))
       1,030,400 bytes maximum slop
             449 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2159 colls,     0 par    0.80s    0.81s     0.0004s    0.0283s
  Gen  1        10 colls,     0 par    0.96s    1.09s     0.1094s    0.4354s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.95s  (  0.97s elapsed)
  GC      time    1.76s  (  1.91s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    2.72s  (  2.88s elapsed)

  %GC     time      64.9%  (66.2% elapsed)

  Alloc rate    1,278,074,521 bytes per MUT second

  Productivity  35.1% of total user, 33.1% of total elapsed

And again, I have to bump up the maximum stack size in order to even run the simulation. I know there must be a big thunk building up somewhere.. but I can't figure out where?

How can I improve the heap/stack allocation and GC in a problem like this? How can I identify where a thunk may be building up? Is the use of the State monad here misguided?

--

UPDATE:

I neglected to look over the output of the profiler when compiling with -fprof-auto. Here is the head of that output:

COST CENTRE                       MODULE                             no.     entries  %time %alloc   %time %alloc

MAIN                              MAIN                                58           0    0.0    0.0   100.0  100.0
 main                             Main                               117           0    0.0    0.0   100.0  100.0
  main.randomList                 Main                               147           1   62.0   55.5    62.0   55.5
  main.arr                        Main                               142           1    0.0    0.0     0.0    0.0
   streamToAssocList              Main                               143           1    0.0    0.0     0.0    0.0
    streamToAssocList.go          Main                               146           5    0.0    0.0     0.0    0.0
  main.pairList                   Main                               137           1    0.0    0.0     9.5   16.5
   consPairStream                 Main                               138           1    0.7    0.9     9.5   16.5
    consPairStream.ys             Main                               140           1    4.3    7.8     4.3    7.8
    consPairStream.xs             Main                               139           1    4.5    7.8     4.5    7.8
  main.initConfig                 Main                               122           1    0.0    0.0     0.0    0.0
   logLikelihood                  Main                               163           0    0.0    0.0     0.0    0.0
   logPrior                       Main                               161           5    0.0    0.0     0.0    0.0
  main.sim                        Main                               118           1    1.0    2.2    28.6   28.1
   simKernel                      Main                               120           0    4.8    5.1    27.6   25.8 

I'm not sure how to interpret this exactly, but the lazy stream of random doubles, randomList, makes me wince. I have no idea how that could be improved.

回答1:

I've updated the hpaste with a working example. It looks like the culprits are:

  • Missing strictness annotations in three SimConfig fields: simArray, logP and logL
    data SimConfig = SimConfig {
            numDimensions :: !Int            -- strict
        ,   numWalkers    :: !Int            -- strict
        ,   simArray      :: !(IntMap [Double]) -- strict spine
        ,   logP          :: !(Seq Double)      -- strict spine
        ,   logL          :: !(Seq Double)      -- strict spine
        ,   pairStream    :: [(Int, Int)]    -- lazy
        ,   doubleStream  :: [Double]        -- lazy 
        } deriving Show
  • newConfig was never evaluated in the simKernel loop due to State being lazy. Another alternative would be to use the strict State monad instead.

    put $! newConfig
    
  • execState ... replicateM also builds thunks. I originally replaced this with a foldl' and moved the execState into the fold, but I would think swapping in replicateM_ is equivalent and easier to read:

    let sim = logL $ execState (replicateM_ epochs simKernel) initConfig
    --  sim = logL $ foldl' (const . execState simKernel) initConfig [1..epochs]
    

And a few calls to mapM .. replicate have been replaced with replicateM. Particularly noteworthy in consPairList where it reduces memory usage quite a bit. There is still room for improvement but the lowest hanging fruit involves unsafeInterleaveST... so I stopped.

I have no idea if the output results are what you want:

fromList [-4.287033457733427,-1.8000404912760795,-5.581988678626085,-0.9362372340483293,-5.267791907985331]

But here are the stats:

     268,004,448 bytes allocated in the heap
      70,753,952 bytes copied during GC
      16,014,224 bytes maximum residency (7 sample(s))
       1,372,456 bytes maximum slop
              40 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       490 colls,     0 par    0.05s    0.05s     0.0001s    0.0012s
  Gen  1         7 colls,     0 par    0.04s    0.05s     0.0076s    0.0209s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.12s  (  0.12s elapsed)
  GC      time    0.09s  (  0.10s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.21s  (  0.22s elapsed)

  %GC     time      42.2%  (45.1% elapsed)

  Alloc rate    2,241,514,569 bytes per MUT second

  Productivity  57.8% of total user, 53.7% of total elapsed