Can you provide examples of parsing HTML?

2018-12-31 15:50发布

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format

Language: [language name]

Library: [library name]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

Purpose: [what the parse does]

29条回答
心情的温度
2楼-- · 2018-12-31 16:08

Language: Ruby
Library: Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"
查看更多
永恒的永恒
4楼-- · 2018-12-31 16:09

language: Python
library: lxml.html

import lxml.html

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
    if attribute == "href":
        print link

lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:

for a in tree.cssselect('a[href]'):
    print a.get('href')
查看更多
荒废的爱情
5楼-- · 2018-12-31 16:10

Language: Java
Libraries: XOM, TagSoup

I've included intentionally malformed and inconsistent XML in this sample.

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Parser parser = new Parser();
        parser.setFeature(Parser.namespacesFeature, false);
        final Builder builder = new Builder(parser);
        final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
        final Element root = document.getRootElement();
        final Nodes links = root.query("//a[@href]");
        for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
            final Node node = links.get(linkNumber);
            System.out.println(((Element) node).getAttributeValue("href"));
        }
    }
}

TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to root.query to include a namespace like so:

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())
查看更多
看淡一切
6楼-- · 2018-12-31 16:11

Language: C#
Library: HtmlAgilityPack

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");

        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerHtml);
        }
    }
}
查看更多
裙下三千臣
7楼-- · 2018-12-31 16:14

language: Ruby
library: Hpricot

#!/usr/bin/ruby

require 'hpricot'

html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
html += '</body></html>'

doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }
查看更多
登录 后发表回答