I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:
<html>
<body>
<table>
<tr>
<td>
<font >
<a href="a">ABC</a></font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-1-2013</font>
<b><font> </font></b>
<font>Where:</font><font>Here</font>
<font>Who:</font><font>Me</font>
</td>
</tr>
<tr>
<td>
<font >
<a href="b">EFG</a>
</font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-2-2013</font>
<b><font> </font></b>
<font>Where:</font><font>There</font>
<font>Who:</font><font>You</font>
</td>
</tr>
<tr>
<td>
<font >
<a href="c">HIJ</a>
</font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-3-2013</font><b>
<font> </font></b>
<font>Where:</font><font>Far away</font>
<font>Who:</font><font>Him</font>
</td>
</tr>
</table>
</body>
</html>
To this, after several iterations, I arrived at this code to achieve what I want:
import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup
group2 [] = []
group2 (x0:x1:xs) = [x0,x1]:(group2 xs)
countRows html = html >>> deep (hasName "tr") >. length
parsePage sz html = let
n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA
m = deep (hasName "td") >>> css "a" /> getText
o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText
p x = (((n x) >>> m) &&& ((n x) >>> o))
in html >>> catA [p x | x <- [0..sz]]
main = do
dt <- readFile "test.html"
let html = parseHtml dt
count <- (runX . countRows) html
let cnt = ((head count) `div` 2) - 1
prcssd <- (runX . (parsePage cnt)) html
print prcssd
And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]
However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.
The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.
I did some html parsing with hxt a few weeks ago and thought, that xpath comes in quite handy. Unfortunately, I didn't come up with a perfect solution for your problem, but it might be a start for a new try.
The result looks like this:
Like I said, not quite perfect, but hopefully a start.
EDIT: Maybe this looks more like your approach. Still, instead of dropping the elements you don't care about, we first choose all elements that fit and filter the results. I think it's quite fascinating that there's no generic approach for such a problem. Because, somehow, the font[4]-selection does not work with my other approach - but maybe I'm just not a good xpath user.
This way, you get the result you showed in your question:
Edit2:
Fun fact: it seems like the hxt-xpath library does not work quite right for such an index-selection. An online XPath-evaluator shows the right behaviour for
//td/font[4]/text()
.Here's a somewhat simpler implementation.
The
>>.
operator can be used instead of>.
so that you don't need to callunlistA
afterwards.I changed the
group2
function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.The type of
trPairs
isi.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up
<tr>
nodes). Now we can use the***
operator fromControl.Arrow
to apply a transformation to either element of the pair,insideLink
for the first one andinsideFont
for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.