With Haskell, how do I process large volumes of XM

I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing.

TagSoup

import Control.Monad
import Text.HTML.TagSoup

userid = "83805"

main = do
  posts <- liftM parseTags (readFile "posts.xml")
  print $ head $ map (fromAttrib "Id") $
                 filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
                 posts

hxt

import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath

userid = "83805"

main = do
  runX $ readDoc "posts.xml" >>> posts >>> arr head
  where
    readDoc = readDocument [ (a_tagsoup, v_1)
                           , (a_parse_xml, v_1)
                           , (a_remove_whitespace, v_1)
                           , (a_issue_warnings, v_0)
                           , (a_trace, v_1)
                           ]

posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
        getAttrValue "Id"
  where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"

xml

import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light

userid = "83805"

main = do
  [posts,votes] <- forM ["posts", "votes"] $
    liftM parseXML . readFile . (++ ".xml")
  let ps = elemNamed "posts" posts
  putStrLn $ maybe "<not present>" show
           $ filterElement (byUser userid) ps

elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems

byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
  where creator = QName "OwnerUserId" Nothing Nothing

Where did I go wrong? What is the proper way to process hefty XML documents with Haskell?

标签： xml haskell tag-soup large-scale large-data

6条回答

叼着烟拽天下

2楼-- · 2019-03-09 15:53

Below is an example that uses hexpat:

{-# LANGUAGE PatternGuards #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (StartElement "row" as)
  | Just ouid <- lookup "OwnerUserId" as = ouid == uid
  | otherwise = False
ownedBy _ _ = False

The definition of ownedBy is a little clunky. Maybe a view pattern instead:

{-# LANGUAGE ViewPatterns #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (ownerUserId -> Just ouid) = uid == ouid
ownedBy _ _ = False

ownerUserId :: SAXEvent String String -> Maybe String
ownerUserId (StartElement "row" as) = lookup "OwnerUserId" as
ownerUserId _ = Nothing

0人赞添加讨论(0) 举报

Lonely孤独者°

3楼-- · 2019-03-09 15:56

Perhaps you need a lazy XML parser: your usage looks like a pretty straightforward scan through the input. HaXml has a lazy parser, although you must ask for it explicitly by importing the correct module.

0人赞添加讨论(0) 举报

傲

4楼-- · 2019-03-09 16:01

I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.

That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

I'm not sure which support bytestrings, but that's the condition you're looking for.

0人赞添加讨论(0) 举报

贪生不怕死

5楼-- · 2019-03-09 16:13

You could try my fast-tagsoup library. It's a simple replacement to tagsoup and parses at speeds of 20-200MB/sec.

The problem with tagsoup package is that it works with String internally even if you use Text or ByteString interface. fast-tagsoup works with strict ByteStrings using high-performance low-level parsing while still returning lazy tags list as output.

0人赞添加讨论(0) 举报

Melony?

6楼-- · 2019-03-09 16:17

I had a similar problem (using HXT) - I avoided the memory issue by using the Expat parser with HXT. On a 5MB XML file, just reading the document and printing it: peak memory consumption went from 2Gigs to about 180MB, and the execution time was much shorter (didn't measure).

0人赞添加讨论(0) 举报

太酷不给撩

7楼-- · 2019-03-09 16:20

TagSoup supports ByteString via its class Text.StringLike. The only changes needed to your example were to call ByteString.Lazy’s readFile, and add a fromString to the fromAttrib:

import Text.StringLike
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BSC

userid = "83805"
file = "blah//posts.xml"
main = do
posts <- liftM parseTags (BSL.readFile file)
print $ head $ map (fromAttrib (fromString "Id")) $
               filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
               posts

Your example ran for me (4 gig RAM), taking 6 minutes; the ByteString version took 10 minutes.

0人赞添加讨论(0) 举报

With Haskell, how do I process large volumes of XM

TagSoup

hxt

xml

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间