How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href
in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]
Language: Ruby
Library: Nokogiri
Language: Perl
Library: HTML::Parser
Purpose: How can I remove unused, nested HTML span tags with a Perl regex?
language: Python
library: lxml.html
lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:
Language: Java
Libraries: XOM, TagSoup
I've included intentionally malformed and inconsistent XML in this sample.
TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to
root.query
to include a namespace like so:Language: C#
Library: HtmlAgilityPack
language: Ruby
library: Hpricot