I am trying to get the first row under the column with the title "Name" so for example for https://en.wikipedia.org/wiki/List_of_the_heaviest_people I want to return the name "Jon Brower Minnoch". My code so far is as follows, but I think there must be a more general way of getting the name:
(defun find-tag (tag doc)
(when (listp doc)
(when (string= (xmls:node-name doc) tag)
(return-from find-tag doc))
(loop for child in (xmls:node-children doc)
for find = (find-tag tag child)
when find do (return-from find-tag find)))
nil)
(defun parse-list-website (url)
(second (second (second (third (find-tag "td" (html5-parser:parse-html5 (drakma:http-request url) :dom :xmls)))))))
and then to call the function:
(parse-list-website "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
I am not very good with xmls and don't know how to get an get a td under a certain column header.
The elements in the document returned by
html5-parser:parse-html5
are in the form:You could access the parts with the standard list manipulation functions, but
xmls
also provides functionsnode-name
,node-attrs
andnode-children
to access the three parts. It's a little bit clearer to use those. Edit: there are also functionsxmlrep-attrib-value
, to get the value of an attribute andxmlrep-tagmatch
to match the tag name. The children are either plain strings, or elements in the same format.So for example, a html document with a 2x2 table would look like this:
In order to traverse the dom-tree, lets define a recursive depth-first search like this (note that the
if-let
depends on thealexandria
library (either import it, or change it toalexandria:if-let
)):It's called with a predicate function and a document. The predicate function gets called with two arguments; the element being matched and a list of its ancestors. In order to find the first
<td>
, you could do this:Or to find the first
<td>
in the even row:Getting the second
<td>
on the even row would require something like this:You could define a helper function to find the nth tag:
You might want to have a simple helper to get the text of a node:
You could define similiar helpers to do whatever you need to do in your application. Using these, the example you gave would look like this: