What is pipes/conduit trying to solve

2020-05-12 21:17发布

问题:

I have seen people recommending pipes/conduit library for various lazy IO related tasks. What problem do these libraries solve exactly?

Also, when I try to use some hackage related libraries, it is highly likely there are three different versions. Example:

  • attoparsec
  • pipes-attoparsec
  • attoparsec-conduit

This confuses me. For my parsing tasks should I use attoparsec or pipes-attoparsec/attoparsec-conduit? What benefit do the pipes/conduit version give me as compared to the plain vanilla attoparsec?

回答1:

Lazy IO

Lazy IO works like this

readFile :: FilePath -> IO ByteString

where ByteString is guaranteed to only be read chunk-by-chunk. To do so we could (almost) write

-- given `readChunk` which reads a chunk beginning at n
readChunk :: FilePath -> Int -> IO (Int, ByteString)

readFile fp = readChunks 0 where
  readChunks n = do
    (n', chunk) <- readChunk fp n
    chunks      <- readChunks n'
    return (chunk <> chunks)

but here we note that the IO action readChunks n' is performed prior to returning even the partial result available as chunk. This means we're not lazy at all. To combat this we use unsafeInterleaveIO

readFile fp = readChunks 0 where
  readChunks n = do
    (n', chunk) <- readChunk fp n
    chunks      <- unsafeInterleaveIO (readChunks n')
    return (chunk <> chunks)

which causes readChunks n' to return immediately, thunking an IO action to be performed only when that thunk is forced.

That's the dangerous part: by using unsafeInterleaveIO we've delayed a bunch of IO actions to non-deterministic points in the future that depend upon how we consume our chunks of ByteString.

Fixing the problem with coroutines

What we'd like to do is slide a chunk processing step in between the call to readChunk and the recursion on readChunks.

readFileCo :: Monoid a => FilePath -> (ByteString -> IO a) -> IO a
readFileCo fp action = readChunks 0 where
  readChunks n = do
    (n', chunk) <- readChunk fp n
    a           <- action chunk
    as          <- readChunks n'
    return (a <> as)

Now we've got the chance to perform arbitrary IO actions after each small chunk is loaded. This lets us do much more work incrementally without completely loading the ByteString into memory. Unfortunately, it's not terrifically compositional--we need to build our consumption action and pass it to our ByteString producer in order for it to run.

Pipes-based IO

This is essentially what pipes solves--it allows us to compose effectful co-routines with ease. For instance, we now write our file reader as a Producer which can be thought of as "streaming" the chunks of the file when its effect gets run finally.

produceFile :: FilePath -> Producer ByteString IO ()
produceFile fp = produce 0 where
  produce n = do
    (n', chunk) <- liftIO (readChunk fp n)
    yield chunk
    produce n'

Note the similarities between this code and readFileCo above—we simply replace the call to the coroutine action with yielding the chunk we've produced so far. This call to yield builds a Producer type instead of a raw IO action which we can compose with other Pipes types in order to build a nice consumption pipeline called an Effect IO ().

All of this pipe building gets done statically without actually invoking any of the IO actions. This is how pipes lets you write your coroutines more easily. All of the effects get triggered at once when we call runEffect in our main IO action.

runEffect :: Effect IO () -> IO ()

Attoparsec

So why would you want to plug attoparsec into pipes? Well, attoparsec is optimized for lazy parsing. If you are producing the chunks fed to an attoparsec parser in an effectful way then you'll be at an impasse. You could

  1. Use strict IO and load the entire string into memory only to consume it lazily with your parser. This is simple, predictable, but inefficient.
  2. Use lazy IO and lose the ability to reason about when your production IO effects will actually get run causing possible resource leaks or closed handle exceptions according to the consumption schedule of your parsed items. This is more efficient than (1) but can easily become unpredictable; or,
  3. Use pipes (or conduit) to build up a system of coroutines which include your lazy attoparsec parser allowing it to operate on as little input as it needs while producing parsed values as lazily as possible across the entire stream.


回答2:

If you want to use attoparsec, use attoparsec

For my parsing tasks should I use attoparsec or pipes-attoparsec/attoparsec-conduit?

Both pipes-attoparsec and attoparsec-conduit transform a given attoparsec Parser into a sink/conduit or pipe. Therefore you have to use attoparsec either way.

What benefit do the pipes/conduit version give me as compared to the plain vanilla attoparsec?

They work with pipes and conduit, where the vanilla one won't (at least not out-of-the-box).

If you don't use conduit or pipes, and you're satisfied with the current performance of your lazy IO, there's no need to change your current flow, especially if you're not writing a big application or process large files. You can simply use attoparsec.

However, that assumes that you know the drawbacks of lazy IO.

What's the matter with lazy IO? (Problem study withFile)

Lets not forget your first question:

What problem do these libraries solve exactly ?

They solve the streaming data problem (see 1 and 3), that occurs within functional languages with lazy IO. Lazy IO sometimes gives you not what you want (see example below), and sometimes it's hard to determine the actual system resources needed by a specific lazy operation (is the data read/written in chunks/bytes/buffered/onclose/onopen…).

Example for over-laziness

import System.IO
main = withFile "myfile" ReadMode hGetContents
       >>= return . (take 5)
       >>= putStrLn

This won't print anything, since the evaluation of the data happens in putStrLn, but the handle has been closed already at this point.

Fixing fire with poisonous acid

While the following snippet fixes this, it has another nasty feature:

main = withFile "myfile" ReadMode $ \handle -> 
           hGetContents handle
       >>= return . (take 5)
       >>= putStrLn

In this case hGetContents will read all of the file, something you didn't expect at first. If you just want to check the magic bytes of a file which could be several GB in size, this is not the way to go.

Using withFile correctly

The solution is, obviously, to take the things in the withFile context:

main = withFile "myfile" ReadMode $ \handle -> 
           fmap (take 5) (hGetContents handle)
       >>= putStrLn

This is by the way, also the solution mentioned by the author of pipes:

This [..] answers a question people sometimes ask me about pipes, which I will paraphase here:

If resource management is not a core focus of pipes, why should I use pipes instead of lazy IO?

Many people who ask this question discovered stream programming through Oleg, who framed the lazy IO problem in terms of resource management. However, I never found this argument compelling in isolation; you can solve most resource management issues simply by separating resource acquisition from the lazy IO, like this: [see last example above]

Which brings us back to my previous statement:

You can simply use attoparsec [...][with lazy IO, assuming] that you know the drawbacks of lazy IO.

References

  • Iteratee I/O, which explains the example better and provides a better overview
  • Gabriel Gonzalez (maintainer/author of pipes): Reasoning about stream programming
  • Michael Snoyman (maintainer/author of conduit): Conduit versus Enumerator


回答3:

Here's a great podcast with authors of both libraries:

http://www.haskellcast.com/episode/006-gabriel-gonzalez-and-michael-snoyman-on-pipes-and-conduit/

It'll answer most of your questions.


In short, both of those libraries approach the problem of streaming, which is very important when dealing with IO. In essence they manage transferring of data in chunks, thus allowing you to e.g. transfer a 1GB file cosuming just 64KB of RAM on both the server and the client. Without streaming you would have had to allocate as much memory on both ends.

An older alternative to those libraries is lazy IO, but it is filled with issues and makes applications error-prone. Those issues are discussed in the podcast.

Concerning which one of those libraries to use, it's more of a matter of taste. I prefer "pipes". The detailed differences are discussed in the podcast too.