Why does this cause a memory leak in the Haskell C

2020-08-15 01:22发布

问题:

I have a conduit pipeline processing a long file. I want to print a progress report for the user every 1000 records, so I've written this:

-- | Every n records, perform the IO action.
-- Used for progress reports to the user.
progress :: (MonadIO m) => Int -> (Int -> i -> IO ()) -> Conduit i m i
progress n act = skipN n 1
   where
      skipN c t = do
         mv <- await
         case mv of
            Nothing -> return ()
            Just v ->
               if c <= 1
                  then do
                     liftIO $ act t v
                     yield v
                     skipN n (succ t)
                  else do
                     yield v
                     skipN (pred c) (succ t)

No matter what action I call this with, it leaks memory, even if I just tell it to print a full stop.

As far as I can see the function is tail recursive and both counters are regularly forced (I tried putting "seq c" and "seq t" in, to no avail). Any clue?

If I put in an "awaitForever" that prints a report for every record then it works fine.

Update 1: This occurs only when compiled with -O2. Profiling indicates that the leaking memory is allocated in the recursive "skipN" function and being retained by "SYSTEM" (whatever that means).

Update 2: I've managed to cure it, at least in the context of my current program. I've replaced the function above with this. Note that "proc" is of type "Int -> Int -> Maybe i -> m ()": to use it you call "await" and pass it the result. For some reason swapping over the "await" and "yield" solved the problem. So now it awaits the next input before yielding the previous result.

-- | Every n records, perform the monadic action. 
-- Used for progress reports to the user.
progress :: (MonadIO m) => Int -> (Int -> i -> IO ()) -> Conduit i m i
progress n act = await >>= proc 1 n
   where
      proc c t = seq c $ seq t $ maybe (return ()) $ \v ->
         if c <= 1
            then {-# SCC "progress.then" #-} do
               liftIO $ act t v
               v1 <- await
               yield v
               proc n (succ t) v1
            else {-# SCC "progress.else" #-} do
               v1 <- await
               yield v
               proc (pred c) (succ t) v1

So if you have a memory leak in a Conduit, try swapping the yield and await actions.

回答1:

This isn't an anwser but it is some complete code I hacked up for testing. I don't know conduit at all, so it may not be the best conduit code. I've forced everything that seems like it needs to be forced, but it still leaks.

{-# LANGUAGE BangPatterns #-}

import Data.Conduit
import Data.Conduit.List
import Control.Monad.IO.Class

-- | Every n records, perform the IO action.
--   Used for progress reports to the user.
progress :: (MonadIO m) => Int -> (Int -> i -> IO ()) -> Conduit i m i
progress n act = skipN n 1
   where
      skipN !c !t = do
         mv <- await
         case mv of
            Nothing -> return ()
            Just !v ->
               if (c :: Int) <= 1
                  then do
                     liftIO $ act t v
                     yield v
                     skipN n (succ t)
                  else do
                     yield v
                     skipN (pred c) (succ t)

main :: IO ()
main = unfold (\b -> b `seq` Just (b, b+1)) 1
       $= progress 100000 (\_ b -> print b)
       $$ fold (\_ _ -> ()) ()

On the other hand,

main = unfold (\b -> b `seq` Just (b, b+1)) 1 $$ fold (\_ _ -> ()) ()

does not leak, so something in progress does indeed seem to be the problem. I can't see what.

EDIT: The leak only occurs with ghci! If I compile a binary and run it there is no leak (I should have tested this earlier ...)



回答2:

I think Tom's answer is the right one, I'm starting this as a separate answer as it will likely introduce some new discussion (and because it's too long for just a comment). In my testing, replacing the print b in Tom's example with return () gets rid of the memory leak. This made me think that the problem is in fact with print, not conduit. To test this theory, I wrote a simple helper function in C (placed in helper.c):

#include <stdio.h>

void helper(int c)
{
    printf("%d\n", c);
}

Then I foreign imported this function in the Haskell code:

foreign import ccall "helper" helper :: Int -> IO ()

and I replaced the call to print with a call to helper. The output from the program is identical, but I show no leak, and a max residency of 32kb vs 62kb (I also modified the code to stop at 10m records for better comparison).

I see similar behavior when I cut out conduit entirely, e.g.:

main :: IO ()
main = forM_ [1..10000000] $ \i ->
    when (i `mod` 100000 == 0) (helper i)

I'm not convinced, however, that this is really a bug in print or Handle. My testing never showed the leak reaching any substantial memory usage, so it could just be that a buffer is growing towards a limit. I'd have to do more research to understand this better, but I wanted to first see if this analysis meshes with what others are seeing.



回答3:

I know it's two years later, but I suspect what's happening is that full laziness is lifting part of the body the await until before the await, and this is causing a space leak. It looks similar to the case in section "Increasing Sharing" in my blog post on this very topic.