I'm trying to figure my way through HXT with XPath and arrows at the same time and I'm completely stuck on how to think through this problem. I've got the following HTML:
<div>
<div class="c1">a</div>
<div class="c2">b</div>
<div class="c3">123</div>
<div class="c4">234</div>
</div>
which I've extracted into an HXT XmlTree. What I'd like to do is define a function (I think?):
getValues :: [String] -> IOSArrow Xmltree [(String, String)]
Which, if used as getValues ["c1", "c2", "c3", "c4"]
, will get me:
[("c1", "a"), ("c2", "b"), ("c3", "123"), ("c4", "234")]
Help please?
Here's one approach (my types are a bit more general and I'm not using XPath):
{-# LANGUAGE Arrows #-}
module Main where
import qualified Data.Map as M
import Text.XML.HXT.Arrow
classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> divs >>> pairs) >>> arr M.fromList
where
divs = getChildren >>> hasName "div"
pairs = proc div -> do
cls <- getAttrValue "class" -< div
val <- deep getText -< div
returnA -< (cls, val)
getValues :: (ArrowXml a) => [String] -> a XmlTree [(String, Maybe String)]
getValues cs = classes >>> arr (zip cs . lookupValues cs)
where lookupValues cs m = map (flip M.lookup m) cs
main = do
let xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
\<div class='c3'>123</div><div class='c4'>234</div></div>"
print =<< runX (readString [] xml >>> getValues ["c1", "c2", "c3", "c4"])
I would probably run an arrow to get the map and then do the lookups, but this way works as well.
To answer your question about listA
: divs >>> divs >>> pairs
is a list arrow with type a XmlTree (String, String)
—i.e., it's a non-deterministic computation that takes an XML tree and returns string pairs.
arr M.fromList
has type a [(String, String)] (M.Map String String)
. This means we can't just compose it with divs >>> divs >>> pairs
, since the types don't match up.
listA
solves this problem: it collapses divs >>> divs >>> pairs
into a deterministic version with type a XmlTree [(String, String)]
, which is exactly what we need.
Here is a way to do it using HandsomeSoup:
-- For the join function.
import Data.String.Utils
import Text.HandsomeSoup
import Text.XML.HXT.Core
-- Of each element, get class attribute and text.
getItem = (this ! "class" &&& (this /> getText))
getItems selectors = css (join "," selectors) >>> getItem
main = do
let selectors = [".c1", ".c2", ".c3", ".c4"]
items <- runX (readDocument [] "data.html" >>> getItems selectors)
print items
data.html is the HTML file.