HXT: Select a node by position with HXT in Haskell

2019-06-24 14:21发布

问题:

I’m trying to parse some XML files with Haskell. For this job I’m using HXT to get some knowledge about arrows in real world applications. So I’m quite new to the arrow topics.

In XPath (and HaXml) it’s possible to select a node by position, let’s say: /root/a[2]/b

I can’t figure out how to do something like that with HXT, even after reading the documentation again and again.

Here is some sample code I’m working with:

module Main where

import Text.XML.HXT.Core

testXml :: String
testXml = unlines
    [ "<?xml version=\"1.0\"?>"
    , "<root>"
    , "    <a>"
    , "        <b>first element</b>"
    , "        <b>second element</b>"
    , "    </a>"
    , "    <a>"
    , "        <b>third element</b>"
    , "    </a>"
    , "    <a>"
    , "        <b>fourth element</b>"
    , "        <b>enough...</b>"
    , "    </a>"
    , "</root>"
    ]

selector :: ArrowXml a => a XmlTree String
selector = getChildren /> isElem >>> hasName "a" -- how to select second <a>?
                       /> isElem >>> hasName "b"
                       /> getText

main :: IO ()
main = do
    let doc = readString [] testXml
    nodes <- runX $ doc >>> selector
    mapM_ putStrLn nodes

The desired output would be:

third element

Thanks in advance!

回答1:

The solution which I believe selects "/root/a[2]/b" (all "b" tags inside second "a" tag):

selector :: ArrowXml a => Int -> a XmlTree String
selector nth =
    (getChildren /> isElem >>> hasName "a")   -- the parentheses required!
    >. (!! nth) 
    /> isElem >>> hasName "b" /> getText

(result is ["third element"]).

Explanation: As I see, class (..., ArrowList a, ...) => ArrowXml a, so ArrowXml a is a subclass for ArrowList. Looking through ArrowList interface:

(>>.) :: a b c -> ([c] -> [d]) -> a b d
(>.) :: a b c -> ([c] -> d) -> a b d

so >>. can select a subset of a list using some lifted [c] -> [d] and >. can select a single item from a list using a lifted function of type [c] -> d. So, after children are selected and tags "a" filtered, let's use (!! nth) :: [a] -> a.

There's an important thing to note:

infix 1 >>>
infix 5 />
infix 8 >.

(so I've had a hard time trying to figure out why >. without parentheses does not work as expected). Thus, getChildren /> isElem >>> hasName "a" must be wrapped in parentheses.



回答2:

This is just an extension to the answer by EarlGray. See the explanation of >>. and >.! After asking the question I recognized that I need to walk through the tree in a special and deterministic way. So this is the solution I’m using for my specific problem. For the case someone else tries to accomplish the same thing, I wanted to share the example code.

Let’s say we want to extract the text of the first <a> and the second <b>. Not all <a> elements have at least two <b>s, so the code of EarlGray would bail out, because you can’t use the (!!) function (empty list!).

Have a look at the function single in Control.Arrow.ArrowList, which is using only the first result of the list arrow:

single :: ArrowList a => a b c -> a b c
single f = f >>. take 1

We wanted to extract the n-th element:

junction :: ArrowList a => a b c -> Int -> a b c
junction a nth = a >>. (take 1 . drop (nth - 1))

Now we can use this new arrow to build up the selector. It’s necessary to use parentheses around the stuff we’re going to filter with junction, because junction modifies an existing arrow.

selector :: ArrowXml a => a XmlTree String
selector = getChildren -- There is only one root element.
         -- For each selected element: Get a list of all children and filter them out.
         -- The junction function now selects at most one element.
         >>> (getChildren >>> isElem >>> hasName "a") `junction` 1 -- selects first <a>
         -- The same thing to select the second <b> for all the <a>s
         -- (But we had selected only one <a> in this case!
         -- Imagine commenting out the `junction` 1 above.)
         >>> (getChildren >>> isElem >>> hasName "b") `junction` 2 -- selects second <b>
         -- Now get the text of the element.
         >>> getChildren >>> getText

To extract the value and return a Maybe value:

main :: IO ()
main = do
    let doc = readString [] testXml
    text <- listToMaybe <$> (runX $ doc >>> selector)
    print text

This outputs Just "second element" with example XML file.