How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href
in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]
Language: PHP
Library: SimpleXML (and DOM)
language: Python
library: HTMLParser
language: Perl
library: HTML::Parser
language: Ruby
library: Nokogiri
Which outputs:
This is a minor spin on the one above, resulting in an output that is usable for a report. I only return the first and last elements in the list of hrefs:
Language: Java
Library: jsoup
Language: JavaScript/Node.js
Library: Request and Cheerio
Request library downloads the html document and Cheerio lets you use jquery css selectors to target the html document.