I have a web page that I've loaded with load/markup. I need to parse a bunch of stuff out of it, but some of the data is in the tags. Any ideas of how I can parse it? Here's a sample of what I've got (and tried) so far:
REBOL []
mess: {
<td>Bob Sockaway</td>
<td><a href=mailto:bsockaway@example.com>bsockaway@example.com</a></td>
<td>9999</td>
}
rules: [
some [
; The expression below /will/ work, but is useless because of specificity.
; <td> <a href=mailto:bsockaway@example.com> s: string! </a> (print s/1) </td> |
; The expression below will not work, because <a> doesn't match <a mailto=...>
; <td> <a> s: string! </a> (print s/1) </td> |
<td> s: string! (print s/1) </td> |
tag! | string! ; Catch any leftovers.
]
]
parse load/markup mess rules
This produces:
Bob Sockaway
9999
I would like to see something more like:
Bob Sockaway
bsockaway@example.com
9999
Any thoughts? Thanks!
Note! For what it's worth, I came up with a good simple ruleset that will get the desired results:
rules: [
some [
<td> any [tag!] s: string! (print s/1) any [tag!] </td> |
tag! | string! ; Catch any leftovers.
]
]
When
mess
is processed withLOAD/MARKUP
you get this (and I've formatted + commented with the types):Your output pattern matches series of the form
[<td> string! </td>]
but not things of the form[<td> tag! string! tag! </td>]
. Sidestepping the question posed in your title, you could solve this particular dilemma several ways. One might be to maintain a count of whether you are inside a TD tag and print any strings when the count is non-zero:This produces the output you asked for:
But you also wanted to know, essentially, whether you can transition into string parsing from block parsing in the same set of rules (without jumping into open code). I looked into it "mixed parsing" looks like it may be a feature addressed in Rebol 3. Still, I couldn't get it to work in practice. So I asked a question of my own.
How to mix together string parsing and block parsing in the same rule?
I think I found a pretty good solution. It may have to be generalized if you had lots of different tags whose attributes you need.
I was looking for the id attribute of the query tag!:
In the parse rule for tag!, I did this:
More tags to look at, I'd use case. And maybe this would be better to set _qid
I ended up needing to parse another tag and this is a nice general pattern