How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href
in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]
Language: Common Lisp
Library: Closure Html, Closure Xml, CL-WHO
(shown using DOM API, without using XPATH or STP API)
Language: JavaScript
Library: jQuery
(using firebug console.debug for output...)
And loading any html page:
Used another each function for this one, I think it's cleaner when chaining methods.
Language: Perl
Library: pQuery
Language: PHP Library: DOM
Sometimes it's useful to put
@
symbol before$doc->loadHTMLFile
to suppress invalid html parsing warningsUsing phantomjs, save this file as extract-links.js:
run: