Can you provide examples of parsing HTML?

2018-12-31 15:50发布

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format

Language: [language name]

Library: [library name]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

Purpose: [what the parse does]

29条回答
梦该遗忘
2楼-- · 2018-12-31 15:51

Language Perl
Library: HTML::LinkExtor

Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

Whole program:

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}

Explanation:

  • use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
  • use HTML::LinkExtor - load of interesting module
  • use LWP::Simple - just a simple way to get some html for tests
  • my $url = 'http://www.google.com/' - which page we will be extracting urls from
  • my $content = get( $url ) - fetches page html
  • my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
  • $p->parse( $content ) - pretty obvious I guess
  • exit - end of program
  • sub process_link - begin of function process_link
  • my ($tag, %attr) - get arguments, which are tag name, and its atributes
  • return unless $tag eq 'a' - skip processing if the tag is not <a>
  • return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
  • print "- $attr{'href'}\n"; - pretty obvious I guess :)
  • return; - finish the function

That's all.

查看更多
人气声优
3楼-- · 2018-12-31 15:51

Language: Perl
Library : HTML::TreeBuilder

use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;

for my $a ($document->find('a')) {
    print $a->attr('href'), "\n" if $a->attr('href');
}
查看更多
人气声优
4楼-- · 2018-12-31 15:52

Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)


Selector expression:

(def test-select
     (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))

Now we can do the following at the REPL (I've added line breaks in test-select):

user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
 {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
 {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")

You'll need the following to try it out:

Preamble:

(require '[net.cgrand.enlive-html :as html])

Test HTML:

(def test-html
     (apply str (concat ["<html><body>"]
                        (for [link ["foo" "bar" "baz"]]
                          (str "<a href=\"http://" link ".com/\">" link "</a>"))
                        ["</body></html>"])))
查看更多
不再属于我。
5楼-- · 2018-12-31 15:56

Language: Racket

Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)

(require net/url
         (planet ashinn/html-parser:1)
         (planet clements/sxml2:1))

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))

Above example using packages from the new package system: html-parsing and sxml

(require net/url
         html-parsing
         sxml)

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/@href/text()") doc))

Note: Install the required packages with 'raco' from a command line, with:

raco pkg install html-parsing

and:

raco pkg install sxml
查看更多
爱死公子算了
6楼-- · 2018-12-31 15:57

Language: Coldfusion 9.0.1+

Library: jSoup

<cfscript>
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
    var s={};s.href= links[a].attr('href'); s.text= links[a].text(); 
    if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); 
}
return res; 
}   

//writeoutput(writedump(parseURL(url)));
</cfscript>
<cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">

Returns an array of structures, each struct contains an HREF and TEXT objects.

查看更多
步步皆殇っ
7楼-- · 2018-12-31 15:58

Language: Python
Library: HTQL

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";

for url, text in htql.HTQL(page, query): 
    print url, text;

Simple and intuitive.

查看更多
登录 后发表回答