Haskell Read Last Line with a Lazy mmap

2019-07-15 04:49发布

I want to read the last line of my file and make sure it has the same number of fields as my first---I don't care about anything in the middle. I'm using mmap because it's fast for random access on large files, but am encountering problems not understanding Haskell or laziness.

λ> import qualified Data.ByteString.Lazy.Char8 as LB
λ> import System.IO.MMap
λ> outh <- mmapFileByteStringLazy fname Nothing 
λ> LB.length outh
87094896
λ> LB.takeWhile (`notElem` "\n") outh
"\"Field1\",\"Field2\",

Great.

From here, I know that

takeWhileR p xs is equivalent to reverse (takeWhileL p (reverse xs)).

So let's make this. That is, let's get the last line by reversing my lazy bytestring, taking while not "\n" just as before, then unreversing it. Laziness makes me think the compiler will let me do this easily.

So trying this out:

LB.reverse (LB.takeWhile (`notElem` "\n") (LB.reverse outh))

What I expect to see is:

"\"val1\",\"val2\",

Instead, this crashes my session.

Segmentation fault (core dumped)

Questions:

  1. What am I doing wrong with laziness, or bytestrings, or the mmap library, or Haskell?
  2. How can I get this line correctly and with memory efficiency? (The answer possibly using foreign pointers instead of lazy bytestrings?)

For other readers, if you're looking to get the last line, you may find a very fast and suitable method described in the answer here: hSeek and SeekFromEnd in Haskell

In this thread, I'm looking specifically for a solution using mmap.

1条回答
Root(大扎)
2楼-- · 2019-07-15 05:33

I would prefer the use of bytestring-mmap made by the same author as bytestring. In either case, all you need is

import System.IO.Posix.MMap (unsafeMMapFile)
import qualified Data.ByteString.Char8 as BS

main = do
   -- can be swapped out for `mmapFileByteString` from `mmap`
  bs <- unsafeMMapFile "file.txt"

  let (firstLine, _) = BS.break (== '\n') bs
      (_, lastLine) = BS.breakEnd (== '\n') bs

  putStrLn $ "First line: " ++ BS.unpack firstLine
  putStrLn $ "Last line: " ++ BS.unpack lastLine

This runs instantly too, with no extra allocations. As before, there is the caveat that many files end in newlines, so one may want to have BS.breakEnd (== '\n') (init bs) to ignore the last \n character.

Also, I would not recommend reversing the bytestring - that will require at least some allocations, which are in this case completely avoidable. Even if you use a lazy bytestring, you still pay the cost of going through all the chunks of the bytestring (which hopefully shouldn't even have been constructed at this point). That said, your reversing code should work. I reckon something is off with mmap (probably the package as the doing the same thing with a strict bytestring works just fine).

Previous answer, from before OP's edit

I'm not sure what the problem is with the functions in System.IO. The following runs instantly on my laptop, file file.txt being almost 4GB. It isn't elegant, but it is certainly efficient.

import System.IO

hGetLastLine :: Handle -> IO String
hGetLastLine hdl = go "" (negate 1)
  where
  go s i = do
    hSeek hdl SeekFromEnd i
    c <- hGetChar hdl
    if c == '\n'
      then pure s
      else go (c:s) (i-1)


main = do
  handle <- openFile "file.txt" ReadMode

  firstLine <- hGetLine handle
  putStrLn $ "First line: " ++ firstLine

  lastLine <- hGetLastLine handle
  putStrLn $ "Last line: " ++ lastLine
查看更多
登录 后发表回答