On reflection, this entire question can be boiled down to something much more concise. I'm looking for a Haskell data structure that
- looks like a list
- has O(1) lookup
- has either O(1) element replacement or O(1) element append (or prepend... I could reverse my index lookups if that were the case). I can always write my later algorithms with one or the other in mind.
- has very little memory overhead
I'm trying to build an image file parser. The file format is your basic 8-bit color ppm file, though I intend to support 16-bit color files and PNG and JPEG files. The existing Netpbm library, despite a lot of unboxing annotations, actually consumes all available memory when trying to load the files that I work with:
3-10 photographs, the smallest being 45MB and the largest being 110MB.
Now, I can't understand the optimizations put into the Netpbm code, so I decided to have my own try at it. It's a simple file format...
I have started by deciding that no matter what the file format, I'm going to store the final image uncompressed in this format:
import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
width :: Int
, height :: Int
, redChannel :: Vector Word8
, greenChannel :: Vector Word8
, blueChannel :: Vector Word8
}
Then I wrote a parser that works on three vectors like so:
import Data.Attoparsec.ByteString
data Progress = Progress {
addr :: Int
, size :: Int
, redC :: Vector Word8
, greenC :: Vector Word8
, blueC :: Vector Word8
}
parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
| addr == size = return progress
| addr < size = do
!redV <- anyWord8
!greenV <- anyWord8
!blueV <- anyWord8
parseColorBinary progress { addr = addr + 1
, redC = redC V.// [(addr, redV)]
, greenC = greenC V.// [(addr, greenV)]
, blueC = blueC V.// [(addr, blueV)] }
And at the end of the parser, I construct the RGB8 like so:
Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC
Written like this, the program, loading a single one of these 45MB images, will consume 5GB or more of memory. If I change the definition of Progress
so that redC
, greenC
, and blueC
are all !(Vector Word8)
, then the program remains within reasonable memory confines, but takes so long to load a single file that I haven't allowed it to actually finish. Finally, if I replace the vectors here with standard lists, my memory usage shoots back up to 5GB per file (I assume... I actually run out of space before I hit that), and load time is on the order of 6 seconds. Ubuntu's preview application, once started, loads and renders the file nearly instantly.
On the theory that each call to V.// is actually fully copying the vector every single time, I tried switching to Data.Vector.Unboxed.Mutable
, but... I can't even get that to typecheck. The documentation is nonexistent and understanding what the data types are doing is going to require fighting with multiple other libraries as well. And I don't even know if it will solve the problems, so I'm very reluctant to even try.
The fundamental problem is actually pretty straightforward:
How do I quickly, and without using an obscene amount of memory, read, retain, and potentially manipulate a very large data structure? All of the examples I have found are about generating temporarily huge data and then getting rid of it as soon as possible.
In principal, I want the final representation to be immutable, but I don't too much care if I have to use a mutable structure to get there.
Just for completeness, the complete code (BSD3-licensed) is on bitbucket in https://bitbucket.org/savannidgerinel/photo-tools . The performance
branch contains a strict version of the parser, which can be made unstrict with a quick change in the Progress
data structure of Codec.Image.Netpbm
.
To run the performance test
ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr
I first thought that just simply reading the whole chunk of bytestring and then unzipping the contents into unboxed vectors would be good enough. Indeed, the parsing code you posted would be rather bad even without the mysterious space leak: you copy the entirety of all three vectors on every single byte of the input! Talk about quadratic complexity.
So I wrote the following:
And then tested it with a 45 Mb file of random bytes. I admit I was surprised that this code resulted in gigabytes of RAM usage. I'm curious as to where exactly the space leaks.
Mutable vectors work well though. The following code uses 133 Mb RAM and Criterion benchmarks it to 60 ms file reading included. I included some explanations in the comments. There is also ample material on the ST monad and mutable vectors on SO and elsewhere (I concur though that the library documentations are unfriendly to beginners).
Here's a version which parses the file straight from the disk without loading any intermediate into memory:
I tested it without the header logic on a 50 MB file and it runs in roughly the same amount of space.