Can you provide examples of parsing HTML?

How do you parse HTML with a variety of languages and parsing libraries?

When answering:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format

Language: [language name]

Library: [library name]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

Purpose: [what the parse does]

Language Perl
Library: HTML::LinkExtor

Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

Whole program:

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = '';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );


sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";


  • use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
  • use HTML::LinkExtor - load of interesting module
  • use LWP::Simple - just a simple way to get some html for tests
  • my $url = '' - which page we will be extracting urls from
  • my $content = get( $url ) - fetches page html
  • my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
  • $p->parse( $content ) - pretty obvious I guess
  • exit - end of program
  • sub process_link - begin of function process_link
  • my ($tag, %attr) - get arguments, which are tag name, and its atributes
  • return unless $tag eq 'a' - skip processing if the tag is not <a>
  • return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
  • print "- $attr{'href'}\n"; - pretty obvious I guess :)
  • return; - finish the function

That's all.

Language: Perl
Library : HTML::TreeBuilder

use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $content = get '';
my $document = HTML::TreeBuilder->new->parse($content)->eof;

for my $a ($document->find('a')) {
    print $a->attr('href'), "\n" if $a->attr('href');
Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)

Selector expression:

(def test-select
     (html/select (html/html-resource ( test-html)) [:a]))

Now we can do the following at the REPL (I've added line breaks in test-select):

user> test-select
({:tag :a, :attrs {:href ""}, :content ["foo"]}
 {:tag :a, :attrs {:href ""}, :content ["bar"]}
 {:tag :a, :attrs {:href ""}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("" "" "")

You'll need the following to try it out:


(require '[net.cgrand.enlive-html :as html])

Test HTML:

(def test-html
     (apply str (concat ["<html><body>"]
                        (for [link ["foo" "bar" "baz"]]
                          (str "<a href=\"http://" link ".com/\">" link "</a>"))
Language: Racket

Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)

(require net/url
         (planet ashinn/html-parser:1)
         (planet clements/sxml2:1))

(define the-url (string->url ""))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))

Above example using packages from the new package system: html-parsing and sxml

(require net/url

(define the-url (string->url ""))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/@href/text()") doc))

Note: Install the required packages with 'raco' from a command line, with:

raco pkg install html-parsing


raco pkg install sxml
Language: Coldfusion 9.0.1+

Library: jSoup

function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links ="a");
for(var a=1;a LT arrayLen(links);a++){
    var s={};s.href= links[a].attr('href'); s.text= links[a].text(); 
    if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); 
return res; 

<cfdump var="#parseURL("")#">

Returns an array of structures, each struct contains an HREF and TEXT objects.

Language: Python
Library: HTQL

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";

for url, text in htql.HTQL(page, query): 
    print url, text;

Simple and intuitive.

