How do I parse a HTML page with Node.js

I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?

标签： node.js html-parsing server-side

6条回答

对你真心纯属浪费

2楼-- · 2019-01-03 02:13

In .NET, there's the HTML Agility Pack, which is an extremely solid HTML parsing library.

0人赞添加讨论(0) 举报

【Aperson】

3楼-- · 2019-01-03 02:16

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

BeautifulSoup for python
you can convert you html to xhtml and use XSLT
HTMLAgilityPack for .NET
CsQuery for .NET (my new favorite)
The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

0人赞添加讨论(0) 举报

不美不萌又怎样

4楼-- · 2019-01-03 02:16

jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.

node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully

0人赞添加讨论(0) 举报

趁早两清

5楼-- · 2019-01-03 02:18

Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.

❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

❁ Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.

0人赞添加讨论(0) 举报

时光不老，我们不散

6楼-- · 2019-01-03 02:20

Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:

https://www.npmjs.org/package/htmlparser2#usage

And the live demo here:

http://demos.forbeslindesay.co.uk/htmlparser2/

0人赞添加讨论(0) 举报

【Aperson】

7楼-- · 2019-01-03 02:33

Htmlparser2 by FB55 seems to be a good alternative.

0人赞添加讨论(0) 举报

How do I parse a HTML page with Node.js

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间