How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href
in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]
Language Perl
Library: HTML::LinkExtor
Beauty of Perl is that you have modules for very specific tasks. Like link extraction.
Whole program:
Explanation:
That's all.
Language: Perl
Library : HTML::TreeBuilder
Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)
Selector expression:
Now we can do the following at the REPL (I've added line breaks in
test-select
):You'll need the following to try it out:
Preamble:
Test HTML:
Language: Racket
Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)
Above example using packages from the new package system: html-parsing and sxml
Note: Install the required packages with 'raco' from a command line, with:
and:
Language: Coldfusion 9.0.1+
Library: jSoup
Returns an array of structures, each struct contains an HREF and TEXT objects.
Language: Python
Library: HTQL
Simple and intuitive.