I want to use xml-conduit
, specifically Text.XML.Stream.Parse
in order to lazily extract a list of objects from a large XML file.
As a test case, I use the recently re-released StackOverflow data dumps. To keep it simple, I intend to extract all usernames from stackoverflow.com-Users.7z
. Even if the file is a .7z
, file
says it is just bzip2-compressed data (there might be some 7zip stuff at the end of the file, but right now I don't care).
A simplified version of the XML would be
<users>
<row id="1" DisplayName="StackOverflow"/>
...
<row id="2597135" DisplayName="Uli Köhler"/>
...
</users>
Based on this previous Q&A and the example on Hackage stream-reading the example XML in bz2-ed form works perfectly for me
However, when using runghc
to run the following program, it runs without printing any output:
{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit (runResourceT, ($$), ($=))
import qualified Data.Conduit.Binary as CB
import Data.Conduit.BZlib
import Data.Conduit
import Data.Text (Text)
import System.IO
import Text.XML.Stream.Parse
import Control.Applicative ((<*))
data User = User {name :: Text} deriving (Show)
parseUserRow = tagName "row" (requireAttr "DisplayName" <* ignoreAttrs) $ \displayName -> do
return $ User displayName
parseUsers = tagNoAttr "users" $ many parseUserRow
main = do
users <- runResourceT $ CB.sourceFile "stackoverflow.com-Users.7z" $= bunzip2 $= parseBytes def $$ force "users required" parseUsers
putStrLn $ unlines $ map show users
I assume this issue occurs because Haskell tries to deeply evaluate the users
list before starting to print it. This theory is supported by the memory usage of the program continually growing about 2 percent per second (source: htop).
How can I "live-stream" the results to stdout? I assume this is possible by adding another conduit statement like $$ CB.sinkFile "output.txt"
at the end. This specific version however expects a Conduit
output of ByteString
. Could you point me in the right direction where to go from here?
Any help will be appreciated!
Made an edit to bring the insightful example from M. Snoyman up to date, but it was tossed by mediocre power-trippers. Therefore, this.
The original won't compile anymore and produces many deprecated warnings (legacy syntax).
ghc 8.6.5, xml-conduit 1.9.0.0
Based on Michael Snoyman's excellent answer here is a modified version that reads the data from
stackoverflow.com-Users.7z
instead of sourcing it from an artificially generated IO stream.For a reference on how to use
xml-conduit
directly, please see Michael's answer. This answer is only provided as an example on how to use the method described there on optionally compressed files.The main change here is that you need to use
runResourceT
to read the file, and the finalprint
needs to belift
ed fromIO ()
toResourceT IO ()
Let me start by saying that the streaming helper API in xml-conduit has not be worked on in years, and could probably benefit from a reimagining given changes that have happened to conduit in the interim. I think there are likely much better ways to accomplish things.
That said, let me explain the problem you're seeing. The
many
function creates a list of results, and will not produce any values until it has finished processing. In your case, there are so many values that this appears to never happen. Ultimately, when the entire file has been read, the entire list of users will be displayed at once. But that's clearly not the behavior you're looking for.Instead, what you want to do is create a stream of
User
values which are produced as soon as they're ready. What you want to do is basically replace themany
function call with a new function which willyield
a result each time it's parsed. A simple implementation of this could be:Also, instead of using
putStrLn $ unlines $ map show
, you want to attach the entire pipeline to a consumer which will print each individually yieldedUser
value. This can be implemented easily withData.Conduit.List.mapM_
, e.g.:CL.mapM_ (liftIO . print)
.I've put together a full example based on your code. The input is an artificially generated infinite XML file, just to prove the point that it really is yielding output immediately.