I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing.
TagSoup
import Control.Monad
import Text.HTML.TagSoup
userid = "83805"
main = do
posts <- liftM parseTags (readFile "posts.xml")
print $ head $ map (fromAttrib "Id") $
filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
posts
hxt
import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath
userid = "83805"
main = do
runX $ readDoc "posts.xml" >>> posts >>> arr head
where
readDoc = readDocument [ (a_tagsoup, v_1)
, (a_parse_xml, v_1)
, (a_remove_whitespace, v_1)
, (a_issue_warnings, v_0)
, (a_trace, v_1)
]
posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
getAttrValue "Id"
where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"
xml
import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light
userid = "83805"
main = do
[posts,votes] <- forM ["posts", "votes"] $
liftM parseXML . readFile . (++ ".xml")
let ps = elemNamed "posts" posts
putStrLn $ maybe "<not present>" show
$ filterElement (byUser userid) ps
elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems
byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
where creator = QName "OwnerUserId" Nothing Nothing
Where did I go wrong? What is the proper way to process hefty XML documents with Haskell?
I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.
That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml
I'm not sure which support bytestrings, but that's the condition you're looking for.
Below is an example that uses hexpat:
{-# LANGUAGE PatternGuards #-}
module Main where
import Text.XML.Expat.SAX
import qualified Data.ByteString.Lazy as B
userid = "83805"
main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
where earliest :: B.ByteString -> SAXEvent String String
earliest = head . filter (ownedBy userid) . parse opts
opts = ParserOptions Nothing Nothing
ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (StartElement "row" as)
| Just ouid <- lookup "OwnerUserId" as = ouid == uid
| otherwise = False
ownedBy _ _ = False
The definition of ownedBy
is a little clunky. Maybe a view pattern instead:
{-# LANGUAGE ViewPatterns #-}
module Main where
import Text.XML.Expat.SAX
import qualified Data.ByteString.Lazy as B
userid = "83805"
main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
where earliest :: B.ByteString -> SAXEvent String String
earliest = head . filter (ownedBy userid) . parse opts
opts = ParserOptions Nothing Nothing
ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (ownerUserId -> Just ouid) = uid == ouid
ownedBy _ _ = False
ownerUserId :: SAXEvent String String -> Maybe String
ownerUserId (StartElement "row" as) = lookup "OwnerUserId" as
ownerUserId _ = Nothing
You could try my fast-tagsoup library. It's a simple replacement to tagsoup and parses at speeds of 20-200MB/sec.
The problem with tagsoup package is that it works with String internally even if you use Text or ByteString interface. fast-tagsoup works with strict ByteStrings using high-performance low-level parsing while still returning lazy tags list as output.
TagSoup supports ByteString via its class Text.StringLike. The only changes needed to your example were to call ByteString.Lazy’s readFile
, and add a fromString
to the fromAttrib
:
import Text.StringLike
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BSC
userid = "83805"
file = "blah//posts.xml"
main = do
posts <- liftM parseTags (BSL.readFile file)
print $ head $ map (fromAttrib (fromString "Id")) $
filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
posts
Your example ran for me (4 gig RAM), taking 6 minutes; the ByteString version took 10 minutes.
I had a similar problem (using HXT) - I avoided the memory issue by using the Expat parser with HXT. On a 5MB XML file, just reading the document and printing it: peak memory consumption went from 2Gigs to about 180MB, and the execution time was much shorter (didn't measure).
Perhaps you need a lazy XML parser: your usage looks like a pretty straightforward scan through the input. HaXml has a lazy parser, although you must ask for it explicitly by importing the correct module.