How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href
in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]
language: shell
library: lynx (well, it's not library, but in shell, every program is kind-of library)
Language: C#
Library: System.XML (standard .NET)
Language: Objective-C
Library: libxml2 + Matt Gallagher's libxml2 wrappers + Ben Copsey's ASIHTTPRequest
language: Perl
library: XML::Twig
caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.
language: Python
library: BeautifulSoup
output:
also possible:
output:
Language: JavaScript
Library: DOM
(using firebug console.debug for output...)