Haskell read/write binary files complete working e

2019-02-18 22:08发布

I wish if someone gives a complete working code that allows to do the following in Haskell:

Read a very large sequence (more than 1 billion elements) of 32-bit int values from a binary file into an appropriate container (e.g. certainly not a list, for performance issues) and doubling each number if it's less than 1000 (decimal) and then write the resulting 32-bit int values to another binary file. I may not want to read the entire contents of the binary file in the memory at once. I want to read one chunk after the previous.

I am confused because I could find very little documentation about this. Data.Binary, ByteString, Word8 and what not, it just adds to the confusion. There is pretty straight-forward solution to such problems in C/C++. Take an array (e.g. of unsigned int) of desired size, and use the read/write library calls and be done with it. In Haskell it didn't seem so easy, at least to me.

I'd appreciate if your solution uses the best possible standard packages that are available with mainstream Haskell (> GHC 7.10) and not some obscure/obsolete ones.

I read from these pages

https://wiki.haskell.org/Binary_IO

https://wiki.haskell.org/Dealing_with_binary_data

3条回答
欢心
2楼-- · 2019-02-18 22:50

If you're doing binary I/O, you almost certainly want ByteString for the actual input/output part. Have a look at the hGet and hPut functions it provides. (Or, if you only need strictly linear access, you can try using lazy I/O, but it's easy to get that wrong.)

Of course, a byte string is just an array of bytes; your next problem is interpreting those bytes as character / integers / doubles / whatever else they're supposed to be. There are a couple of packages for that, but Data.Binary seems to be the most mainstream one.

The documentation for binary seems to want to steer you towards using the Binary class, where you write code to serialise and deserialise whole objects. But you can use the functions in Data.Binary.Get and Data.Binary.Put to deal with individual items. There you will find functions such as getWord32be (get Word32 big-endian) and so forth.

I don't have time to write a working code example right now, but basically look at the functions I mention above and ignore everything else, and you should get some idea.

Now with working code:

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import Data.Binary.Put
import Control.Monad
import System.IO

main = do
  h_in  <- openFile "Foo.bin" ReadMode
  h_out <- openFile "Bar.bin" WriteMode
  replicateM 1000 (process_chunk h_in h_out)
  hClose h_in
  hClose h_out

chunk_size = 1000
int_size = 4

process_chunk h_in h_out = do
  bin1 <- BIN.hGet h_in chunk_size
  let ints1 = runGet (replicateM (chunk_size `div` int_size) getWord32le) bin1
  let ints2 = map (\ x -> if x < 1000 then 2*x else x) ints1
  let bin2 = runPut (mapM_ putWord32le ints2)
  BIN.hPut h_out bin2

This, I believe, does what you asked for. It reads 1000 chunks of chunk_size bytes, converts each one into a list of Word32 (so it only ever has chunk_size / 4 integers in memory at once), does the calculation you specified, and writes the result back out again.

Obviously if you did this "for real" you'd want EOF checking and such.

查看更多
Evening l夕情丶
3楼-- · 2019-02-18 22:58

Best way to work with binary I/O in Haskell is by using bytestrings. Lazy bytestrings provide buffered I/O, so you don't even need to care about it.

Code below assumes that chunk size is a multiple of 32-bit (which it is).

module Main where

import Data.Word
import Control.Monad
import Data.Binary.Get
import Data.Binary.Put
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString as BStrict

-- Convert one bytestring chunk to the list of integers
-- and append the result of conversion of the later chunks.
-- It actually appends only closure which will evaluate next
-- block of numbers on demand.
toNumbers :: BStrict.ByteString -> [Word32] -> [Word32]
toNumbers chunk rest = chunkNumbers ++ rest
    where
    getNumberList = replicateM (BStrict.length chunk `div` 4) getWord32le
    chunkNumbers = runGet getNumberList (BS.fromStrict chunk)

main :: IO()
main = do
    -- every operation below is done lazily, consuming input as necessary
    input <- BS.readFile "in.dat"
    let inNumbers = BS.foldrChunks toNumbers [] input
    let outNumbers = map (\x -> if x < 1000 then 2*x else x) inNumbers
    let output = runPut (mapM_ putWord32le outNumbers)
    -- There lazy bytestring output is evaluated and saved chunk
    -- by chunk, pulling data from input file, decoding, processing
    -- and encoding it back one chunk at a time
    BS.writeFile "out.dat" output
查看更多
时光不老,我们不散
4楼-- · 2019-02-18 23:00

Here is a loop to process one line at a time from stdin:

import System.IO

loop = do b <- hIsEOF stdin
          if b then return ()
               else do str <- hGetLine stdin
                       let str' = ...process str...
                       hPutStrLn stdout str'

Now just replace hGetLine with something that reads 4 bytes, etc.

Here is the I/O section for Data.ByteString:

https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString.html#g:29

查看更多
登录 后发表回答