I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing.
TagSoup
import Control.Monad
import Text.HTML.TagSoup
userid = "83805"
main = do
posts <- liftM parseTags (readFile "posts.xml")
print $ head $ map (fromAttrib "Id") $
filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
posts
hxt
import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath
userid = "83805"
main = do
runX $ readDoc "posts.xml" >>> posts >>> arr head
where
readDoc = readDocument [ (a_tagsoup, v_1)
, (a_parse_xml, v_1)
, (a_remove_whitespace, v_1)
, (a_issue_warnings, v_0)
, (a_trace, v_1)
]
posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
getAttrValue "Id"
where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"
xml
import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light
userid = "83805"
main = do
[posts,votes] <- forM ["posts", "votes"] $
liftM parseXML . readFile . (++ ".xml")
let ps = elemNamed "posts" posts
putStrLn $ maybe "<not present>" show
$ filterElement (byUser userid) ps
elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems
byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
where creator = QName "OwnerUserId" Nothing Nothing
Where did I go wrong? What is the proper way to process hefty XML documents with Haskell?
Below is an example that uses hexpat:
The definition of
ownedBy
is a little clunky. Maybe a view pattern instead:Perhaps you need a lazy XML parser: your usage looks like a pretty straightforward scan through the input. HaXml has a lazy parser, although you must ask for it explicitly by importing the correct module.
I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.
That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml
I'm not sure which support bytestrings, but that's the condition you're looking for.
You could try my fast-tagsoup library. It's a simple replacement to tagsoup and parses at speeds of 20-200MB/sec.
The problem with tagsoup package is that it works with String internally even if you use Text or ByteString interface. fast-tagsoup works with strict ByteStrings using high-performance low-level parsing while still returning lazy tags list as output.
I had a similar problem (using HXT) - I avoided the memory issue by using the Expat parser with HXT. On a 5MB XML file, just reading the document and printing it: peak memory consumption went from 2Gigs to about 180MB, and the execution time was much shorter (didn't measure).
TagSoup supports ByteString via its class Text.StringLike. The only changes needed to your example were to call ByteString.Lazy’s
readFile
, and add afromString
to thefromAttrib
:Your example ran for me (4 gig RAM), taking 6 minutes; the ByteString version took 10 minutes.