Can you provide examples of parsing HTML?

2018-12-31 15:50发布

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format

Language: [language name]

Library: [library name]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

Purpose: [what the parse does]

29条回答
浅入江南
2楼-- · 2018-12-31 16:01

language: shell
library: lynx (well, it's not library, but in shell, every program is kind-of library)

lynx -dump -listonly http://news.google.com/
查看更多
临风纵饮
3楼-- · 2018-12-31 16:01

Language: C#
Library: System.XML (standard .NET)

using System.Collections.Generic;
using System.Xml;

public static void Main(string[] args)
{
    List<string> matches = new List<string>();

    XmlDocument xd = new XmlDocument();
    xd.LoadXml("<html>...</html>");

    FindHrefs(xd.FirstChild, matches);
}

static void FindHrefs(XmlNode xn, List<string> matches)
{
    if (xn.Attributes != null && xn.Attributes["href"] != null)
        matches.Add(xn.Attributes["href"].InnerXml);

    foreach (XmlNode child in xn.ChildNodes)
        FindHrefs(child, matches);
}
查看更多
不流泪的眼
4楼-- · 2018-12-31 16:01

Language: Objective-C
Library: libxml2 + Matt Gallagher's libxml2 wrappers + Ben Copsey's ASIHTTPRequest

ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
    NSData *response = [request responseData];
    NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
    [request release];
}
else 
    @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];

...

- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
    NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
    if (nodes != nil)
        return nodes;
    return nil;
}
查看更多
骚的不知所云
5楼-- · 2018-12-31 16:02

language: Perl
library: XML::Twig

#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';

use LWP::Simple;
use XML::Twig;

#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;

my $twig = XML::Twig->new();
$twig->parse_html($content);

my @hrefs = map {
    $_->att('href');
} $twig->get_xpath('//*[@href]');

print "$_\n" for @hrefs;

caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.

查看更多
梦寄多情
6楼-- · 2018-12-31 16:05

language: Python
library: BeautifulSoup

from BeautifulSoup import BeautifulSoup

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links  

output:

[<a href="http://foo.com">foo</a>,
 <a href="http://bar.com">bar</a>,
 <a href="http://baz.com">baz</a>]

also possible:

for link in links:
    print link['href']

output:

http://foo.com
http://bar.com
http://baz.com
查看更多
与风俱净
7楼-- · 2018-12-31 16:05

Language: JavaScript
Library: DOM

var links = document.links;
for(var i in links){
    var href = links[i].href;
    if(href != null) console.debug(href);
}

(using firebug console.debug for output...)

查看更多
登录 后发表回答