I wish if someone gives a complete working code that allows to do the following in Haskell:
Read a very large sequence (more than 1 billion elements) of 32-bit
int values from a binary file into an appropriate container (e.g.
certainly not a list, for performance issues) and doubling each number
if it's less than 1000 (decimal) and then write the resulting 32-bit
int values to another binary file. I may not want to read the entire
contents of the binary file in the memory at once. I want to read one
chunk after the previous.
I am confused because I could find very little documentation about this. Data.Binary, ByteString, Word8 and what not, it just adds to the confusion. There is pretty straight-forward solution to such problems in C/C++. Take an array (e.g. of unsigned int) of desired size, and use the read/write library calls and be done with it. In Haskell it didn't seem so easy, at least to me.
I'd appreciate if your solution uses the best possible standard packages that are available with mainstream Haskell (> GHC 7.10) and not some obscure/obsolete ones.
I read from these pages
https://wiki.haskell.org/Binary_IO
https://wiki.haskell.org/Dealing_with_binary_data
If you're doing binary I/O, you almost certainly want ByteString
for the actual input/output part. Have a look at the hGet
and hPut
functions it provides. (Or, if you only need strictly linear access, you can try using lazy I/O, but it's easy to get that wrong.)
Of course, a byte string is just an array of bytes; your next problem is interpreting those bytes as character / integers / doubles / whatever else they're supposed to be. There are a couple of packages for that, but Data.Binary
seems to be the most mainstream one.
The documentation for binary
seems to want to steer you towards using the Binary
class, where you write code to serialise and deserialise whole objects. But you can use the functions in Data.Binary.Get
and Data.Binary.Put
to deal with individual items. There you will find functions such as getWord32be
(get Word32
big-endian) and so forth.
I don't have time to write a working code example right now, but basically look at the functions I mention above and ignore everything else, and you should get some idea.
Now with working code:
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import Data.Binary.Put
import Control.Monad
import System.IO
main = do
h_in <- openFile "Foo.bin" ReadMode
h_out <- openFile "Bar.bin" WriteMode
replicateM 1000 (process_chunk h_in h_out)
hClose h_in
hClose h_out
chunk_size = 1000
int_size = 4
process_chunk h_in h_out = do
bin1 <- BIN.hGet h_in chunk_size
let ints1 = runGet (replicateM (chunk_size `div` int_size) getWord32le) bin1
let ints2 = map (\ x -> if x < 1000 then 2*x else x) ints1
let bin2 = runPut (mapM_ putWord32le ints2)
BIN.hPut h_out bin2
This, I believe, does what you asked for. It reads 1000 chunks of chunk_size
bytes, converts each one into a list of Word32
(so it only ever has chunk_size / 4
integers in memory at once), does the calculation you specified, and writes the result back out again.
Obviously if you did this "for real" you'd want EOF checking and such.
Best way to work with binary I/O in Haskell is by using bytestrings. Lazy bytestrings provide buffered I/O, so you don't even need to care about it.
Code below assumes that chunk size is a multiple of 32-bit (which it is).
module Main where
import Data.Word
import Control.Monad
import Data.Binary.Get
import Data.Binary.Put
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString as BStrict
-- Convert one bytestring chunk to the list of integers
-- and append the result of conversion of the later chunks.
-- It actually appends only closure which will evaluate next
-- block of numbers on demand.
toNumbers :: BStrict.ByteString -> [Word32] -> [Word32]
toNumbers chunk rest = chunkNumbers ++ rest
where
getNumberList = replicateM (BStrict.length chunk `div` 4) getWord32le
chunkNumbers = runGet getNumberList (BS.fromStrict chunk)
main :: IO()
main = do
-- every operation below is done lazily, consuming input as necessary
input <- BS.readFile "in.dat"
let inNumbers = BS.foldrChunks toNumbers [] input
let outNumbers = map (\x -> if x < 1000 then 2*x else x) inNumbers
let output = runPut (mapM_ putWord32le outNumbers)
-- There lazy bytestring output is evaluated and saved chunk
-- by chunk, pulling data from input file, decoding, processing
-- and encoding it back one chunk at a time
BS.writeFile "out.dat" output
Here is a loop to process one line at a time from stdin
:
import System.IO
loop = do b <- hIsEOF stdin
if b then return ()
else do str <- hGetLine stdin
let str' = ...process str...
hPutStrLn stdout str'
Now just replace hGetLine
with something that reads 4 bytes, etc.
Here is the I/O section for Data.ByteString
:
https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString.html#g:29