How to use Python's HTMLParser to extract spec

I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this:

def handle_starttag(self, tag, attrs):
    if tag == 'a':
        for (key, value) in attrs:
            if key == 'href':
                newUrl = urljoin(self.baseUrl, value)
                self.links = self.links + [newUrl]

This worked very well when I wanted to find every link on the page. Now I only want to fetch certain links.

How would I go about only fetching links that are between the <td class="title"> and </td> tags, like this:

<td class="title"><a href="http://www.stackoverflow.com">StackOverflow</a><span class="comhead"> (arstechnica.com) </span></td>

标签： python parsing hyperlink web-crawler html-parsing

1条回答

唯我独甜

2楼-- · 2019-07-26 17:51

HTMLParser is a SAX-style or streaming parser, which means that you get pieces of the document as they are parsed, but not the whole document at once. The parser calls methods you provide to handle tags and other types of data. Any context you may be interested yourself, such as which tags are inside other tags, you must glean from the tags you see passing by.

For example, if you see a <td> tag, then you know you are in a table cell, and can set a flag to that effect. When you see </td>, you know you have left a table cell and can clear that flag. To get the links inside a table cell, then, if you see <a> and you know that you are in a table cell (because of that flag you set), you grab the value of the tag's href attribute if it has one.

from HTMLParser import HTMLParser

class LinkExctractor(HTMLParser):

    def reset(self):
        HTMLParser.reset(self)
        self.extracting = False
        self.links      = []

    def handle_startag(self, tag, attrs):
        if tag == "td" or tag == "a":
            attrs = dict(attrs)   # save us from iterating over the attrs
        if tag == "td" and attrs.get("class", "") == "title":
            self.extracting = True
        elif tag == "a" and "href" in attrs and self.extracting:
            self.links.append(attrs["href"])

    def handle_endtag(self, tag):
        if tag == "td":
            self.extracting = False

This quickly gets to be a pain as you need more and more context to get what you want from the document, which is why people are recommending lxml and BeautifulSoup. These are DOM-style parsers that keep track of the document hierarchy for you and provide various friendly ways to navigate it, such as a DOM API, XPath, and/or CSS selectors.

BTW, I answered a similar question recently here.

0人赞添加讨论(0) 举报

How to use Python's HTMLParser to extract spec

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间