I’m trying to parse some XML files with Haskell. For this job I’m using HXT to get some knowledge about arrows in real world applications. So I’m quite new to the arrow topics.
In XPath (and HaXml) it’s possible to select a node by position, let’s say: /root/a[2]/b
I can’t figure out how to do something like that with HXT, even after reading the documentation again and again.
Here is some sample code I’m working with:
module Main where
import Text.XML.HXT.Core
testXml :: String
testXml = unlines
[ "<?xml version=\"1.0\"?>"
, "<root>"
, " <a>"
, " <b>first element</b>"
, " <b>second element</b>"
, " </a>"
, " <a>"
, " <b>third element</b>"
, " </a>"
, " <a>"
, " <b>fourth element</b>"
, " <b>enough...</b>"
, " </a>"
, "</root>"
]
selector :: ArrowXml a => a XmlTree String
selector = getChildren /> isElem >>> hasName "a" -- how to select second <a>?
/> isElem >>> hasName "b"
/> getText
main :: IO ()
main = do
let doc = readString [] testXml
nodes <- runX $ doc >>> selector
mapM_ putStrLn nodes
The desired output would be:
third element
Thanks in advance!
The solution which I believe selects "/root/a[2]/b" (all "b" tags inside second "a" tag):
selector :: ArrowXml a => Int -> a XmlTree String
selector nth =
(getChildren /> isElem >>> hasName "a") -- the parentheses required!
>. (!! nth)
/> isElem >>> hasName "b" /> getText
(result is ["third element"]
).
Explanation: As I see, class (..., ArrowList a, ...) => ArrowXml a
, so ArrowXml a
is a subclass for ArrowList
. Looking through ArrowList
interface:
(>>.) :: a b c -> ([c] -> [d]) -> a b d
(>.) :: a b c -> ([c] -> d) -> a b d
so >>.
can select a subset of a list using some lifted [c] -> [d]
and >.
can select a single item from a list using a lifted function of type [c] -> d
. So, after children are selected and tags "a" filtered, let's use (!! nth) :: [a] -> a
.
There's an important thing to note:
infix 1 >>>
infix 5 />
infix 8 >.
(so I've had a hard time trying to figure out why >.
without parentheses does not work as expected). Thus, getChildren /> isElem >>> hasName "a"
must be wrapped in parentheses.
This is just an extension to the answer by EarlGray. See the explanation of >>.
and >.
! After asking the question I recognized that I need to walk through the tree in a special and deterministic way. So this is the solution I’m using for my specific problem. For the case someone else tries to accomplish the same thing, I wanted to share the example code.
Let’s say we want to extract the text of the first <a>
and the second <b>
. Not all <a>
elements have at least two <b>
s, so the code of EarlGray would bail out, because you can’t use the (!!)
function (empty list!).
Have a look at the function single
in Control.Arrow.ArrowList, which is using only the first result of the list arrow:
single :: ArrowList a => a b c -> a b c
single f = f >>. take 1
We wanted to extract the n-th element:
junction :: ArrowList a => a b c -> Int -> a b c
junction a nth = a >>. (take 1 . drop (nth - 1))
Now we can use this new arrow to build up the selector. It’s necessary to use parentheses around the stuff we’re going to filter with junction
, because junction
modifies an existing arrow.
selector :: ArrowXml a => a XmlTree String
selector = getChildren -- There is only one root element.
-- For each selected element: Get a list of all children and filter them out.
-- The junction function now selects at most one element.
>>> (getChildren >>> isElem >>> hasName "a") `junction` 1 -- selects first <a>
-- The same thing to select the second <b> for all the <a>s
-- (But we had selected only one <a> in this case!
-- Imagine commenting out the `junction` 1 above.)
>>> (getChildren >>> isElem >>> hasName "b") `junction` 2 -- selects second <b>
-- Now get the text of the element.
>>> getChildren >>> getText
To extract the value and return a Maybe value:
main :: IO ()
main = do
let doc = readString [] testXml
text <- listToMaybe <$> (runX $ doc >>> selector)
print text
This outputs Just "second element"
with example XML file.