Navigating a webpage using html5-parser and xmls C

2019-08-13 00:25发布

问题:

I am trying to get the first row under the column with the title "Name" so for example for https://en.wikipedia.org/wiki/List_of_the_heaviest_people I want to return the name "Jon Brower Minnoch". My code so far is as follows, but I think there must be a more general way of getting the name:

(defun find-tag (tag doc)
 (when (listp doc)
  (when (string= (xmls:node-name doc) tag)
   (return-from find-tag doc))
  (loop for child in (xmls:node-children doc)
   for find = (find-tag tag child)
   when find do (return-from find-tag find)))
  nil)

(defun parse-list-website (url)
  (second (second (second (third (find-tag "td" (html5-parser:parse-html5 (drakma:http-request url) :dom :xmls)))))))

and then to call the function:

(parse-list-website "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")

I am not very good with xmls and don't know how to get an get a td under a certain column header.

回答1:

The elements in the document returned by html5-parser:parse-html5 are in the form:

("name" (attribute-alist) &rest children)

You could access the parts with the standard list manipulation functions, but xmls also provides functions node-name, node-attrs and node-children to access the three parts. It's a little bit clearer to use those. Edit: there are also functions xmlrep-attrib-value, to get the value of an attribute and xmlrep-tagmatch to match the tag name. The children are either plain strings, or elements in the same format.

So for example, a html document with a 2x2 table would look like this:

(defparameter *doc*
  '("html" ()
     ("head" ()
       ("title" ()
         "Some title"))
     ("body" ()
       ("table" (("class" "some-class"))
         ("tr" (("class" "odd"))
           ("td" () "Some string")
           ("td" () "Another string"))
         ("tr" (("class" "even"))
           ("td" () "Third string")
           ("td" () "Fourth string"))))))

In order to traverse the dom-tree, lets define a recursive depth-first search like this (note that the if-let depends on the alexandria library (either import it, or change it to alexandria:if-let)):

(defun find-tag (predicate doc &optional path)
  (when (funcall predicate doc path)
    (return-from find-tag doc))

  (when (listp doc)
    (let ((path (cons doc path)))
      (dolist (child (xmls:node-children doc))
        (if-let ((find (find-tag predicate child path)))
          (return-from find-tag find))))))

It's called with a predicate function and a document. The predicate function gets called with two arguments; the element being matched and a list of its ancestors. In order to find the first <td>, you could do this:

(find-tag (lambda (el path)
            (declare (ignore path))
            (and (listp el)
                 (xmls:xmlrep-tagmatch "td" el)))
          *doc*)
; => ("td" NIL "Some string")

Or to find the first <td> in the even row:

(find-tag (lambda (el path)
            (and (listp el)
                 (xmls:xmlrep-tagmatch "td" el)
                 (string= (xmls:xmlrep-attrib-value "class" (first path))
                          "even")))
          *doc*)
; => ("td" NIL "Third string")

Getting the second <td> on the even row would require something like this:

(let ((matches 0))
  (find-tag (lambda (el path)
              (when (and (listp el)
                         (xmls:xmlrep-tagmatch "td" el)
                         (string= (xmls:xmlrep-attrib-value "class" (first path))
                                  "even"))
                (incf matches))
              (= matches 2))
            *doc*))

You could define a helper function to find the nth tag:

(defun find-nth-tag (n tag doc)
  (let ((matches 0))
    (find-tag (lambda (el path)
                (declare (ignore path))
                (when (and (listp el)
                           (xmls:xmlrep-tagmatch tag el))
                  (incf matches))
                (= matches n))
              doc)))
(find-nth-tag 2 "td" *doc*) ; => ("td" NIL "Another string")
(find-nth-tag 4 "td" *doc*) ; => ("td" NIL "Fourth string")

You might want to have a simple helper to get the text of a node:

(defun node-text (el)
  (if (listp el)
      (first (xmls:node-children el))
      el))

You could define similiar helpers to do whatever you need to do in your application. Using these, the example you gave would look like this:

(defparameter *doc*
  (html5-parser:parse-html5
   (drakma:http-request "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")
   :dom :xmls))

(node-text (find-nth-tag 1 "a" (find-nth-tag 1 "td" *doc*)))
; => "Jon Brower Minnoch"