How do I parse a large data block into memory in H

2019-04-06 00:43发布

问题:

On reflection, this entire question can be boiled down to something much more concise. I'm looking for a Haskell data structure that

  • looks like a list
  • has O(1) lookup
  • has either O(1) element replacement or O(1) element append (or prepend... I could reverse my index lookups if that were the case). I can always write my later algorithms with one or the other in mind.
  • has very little memory overhead

I'm trying to build an image file parser. The file format is your basic 8-bit color ppm file, though I intend to support 16-bit color files and PNG and JPEG files. The existing Netpbm library, despite a lot of unboxing annotations, actually consumes all available memory when trying to load the files that I work with:

3-10 photographs, the smallest being 45MB and the largest being 110MB.

Now, I can't understand the optimizations put into the Netpbm code, so I decided to have my own try at it. It's a simple file format...

I have started by deciding that no matter what the file format, I'm going to store the final image uncompressed in this format:

import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }

Then I wrote a parser that works on three vectors like so:

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }

And at the end of the parser, I construct the RGB8 like so:

Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC

Written like this, the program, loading a single one of these 45MB images, will consume 5GB or more of memory. If I change the definition of Progress so that redC, greenC, and blueC are all !(Vector Word8), then the program remains within reasonable memory confines, but takes so long to load a single file that I haven't allowed it to actually finish. Finally, if I replace the vectors here with standard lists, my memory usage shoots back up to 5GB per file (I assume... I actually run out of space before I hit that), and load time is on the order of 6 seconds. Ubuntu's preview application, once started, loads and renders the file nearly instantly.

On the theory that each call to V.// is actually fully copying the vector every single time, I tried switching to Data.Vector.Unboxed.Mutable, but... I can't even get that to typecheck. The documentation is nonexistent and understanding what the data types are doing is going to require fighting with multiple other libraries as well. And I don't even know if it will solve the problems, so I'm very reluctant to even try.

The fundamental problem is actually pretty straightforward:

How do I quickly, and without using an obscene amount of memory, read, retain, and potentially manipulate a very large data structure? All of the examples I have found are about generating temporarily huge data and then getting rid of it as soon as possible.

In principal, I want the final representation to be immutable, but I don't too much care if I have to use a mutable structure to get there.


Just for completeness, the complete code (BSD3-licensed) is on bitbucket in https://bitbucket.org/savannidgerinel/photo-tools . The performance branch contains a strict version of the parser, which can be made unstrict with a quick change in the Progress data structure of Codec.Image.Netpbm.

To run the performance test

ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr

回答1:

I first thought that just simply reading the whole chunk of bytestring and then unzipping the contents into unboxed vectors would be good enough. Indeed, the parsing code you posted would be rather bad even without the mysterious space leak: you copy the entirety of all three vectors on every single byte of the input! Talk about quadratic complexity.

So I wrote the following:

chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)

And then tested it with a 45 Mb file of random bytes. I admit I was surprised that this code resulted in gigabytes of RAM usage. I'm curious as to where exactly the space leaks.

Mutable vectors work well though. The following code uses 133 Mb RAM and Criterion benchmarks it to 60 ms file reading included. I included some explanations in the comments. There is also ample material on the ST monad and mutable vectors on SO and elsewhere (I concur though that the library documentations are unfriendly to beginners).

import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]


回答2:

Here's a version which parses the file straight from the disk without loading any intermediate into memory:

import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
rgbVectors
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
  where
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

triples
    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
parse'
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap

I tested it without the header logic on a 50 MB file and it runs in roughly the same amount of space.