I have a web page that I've loaded with load/markup. I need to parse a bunch of stuff out of it, but some of the data is in the tags. Any ideas of how I can parse it? Here's a sample of what I've got (and tried) so far:
REBOL []
mess: {
<td>Bob Sockaway</td>
<td><a href=mailto:bsockaway@example.com>bsockaway@example.com</a></td>
<td>9999</td>
}
rules: [
some [
; The expression below /will/ work, but is useless because of specificity.
; <td> <a href=mailto:bsockaway@example.com> s: string! </a> (print s/1) </td> |
; The expression below will not work, because <a> doesn't match <a mailto=...>
; <td> <a> s: string! </a> (print s/1) </td> |
<td> s: string! (print s/1) </td> |
tag! | string! ; Catch any leftovers.
]
]
parse load/markup mess rules
This produces:
Bob Sockaway
9999
I would like to see something more like:
Bob Sockaway
bsockaway@example.com
9999
Any thoughts? Thanks!
Note! For what it's worth, I came up with a good simple ruleset that will get the desired results:
rules: [
some [
<td> any [tag!] s: string! (print s/1) any [tag!] </td> |
tag! | string! ; Catch any leftovers.
]
]
When mess
is processed with LOAD/MARKUP
you get this (and I've formatted + commented with the types):
[
; string!
"^/"
; tag! string! tag!
<td> "Bob Sockaway" </td>
; string!
"^/"
; tag! tag!
; string!
; tag! tag!
<td> <a href=mailto:bsockaway@example.com>
"bsockaway@example.com"
</a> </td>
; (Note: you didn't put the anchor's href in quotes above...)
; string!
"^/"
; tag! string! tag!
<td> "9999" </td>
; string!
"^/"
]
Your output pattern matches series of the form [<td> string! </td>]
but not things of the form [<td> tag! string! tag! </td>]
. Sidestepping the question posed in your title, you could solve this particular dilemma several ways. One might be to maintain a count of whether you are inside a TD tag and print any strings when the count is non-zero:
rules: [
(td-count: 0)
some [
; if we see an open TD tag, increment a counter
<td> (++ td-count)
|
; if we see a close TD tag, decrement a counter
</td> (-- td-count)
|
; capture parse position in s if we find a string
; and if counter is > 0 then print the first element at
; the parse position (e.g. the string we just found)
s: string! (if td-count > 0 [print s/1])
|
; if we find any non-TD tags, match them so the
; parser will continue along but don't run any code
tag!
]
]
This produces the output you asked for:
Bob Sockaway
bsockaway@example.com
9999
But you also wanted to know, essentially, whether you can transition into string parsing from block parsing in the same set of rules (without jumping into open code). I looked into it "mixed parsing" looks like it may be a feature addressed in Rebol 3. Still, I couldn't get it to work in practice. So I asked a question of my own.
How to mix together string parsing and block parsing in the same rule?
I think I found a pretty good solution. It may have to be generalized if you had lots of different tags whose attributes you need.
I was looking for the id attribute of the query tag!:
<query id="5">
In the parse rule for tag!, I did this:
| set t tag! (
p: make block! t
if p/1 = 'query [_qid: to-integer p/3]
)
More tags to look at, I'd use case. And maybe this would be better to set _qid
to-integer select p 'id=
I ended up needing to parse another tag and this is a nice general pattern
switch p/1 [
field [_fid: to-integer p/id= _field_type: p/field_type=]
query [_qid: to-integer p/id=]
]