I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.
Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?
In .NET, there's the HTML Agility Pack, which is an extremely solid HTML parsing library.
You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.
Other options include:
Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.
jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.
node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully
Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.
Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:
https://www.npmjs.org/package/htmlparser2#usage
And the live demo here:
http://demos.forbeslindesay.co.uk/htmlparser2/
Htmlparser2 by FB55 seems to be a good alternative.