Jsoup like html parser for C++ [closed]

2020-02-08 05:54发布

站内文章 / C++

41 0

一夜七次

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it's on-topic for Stack Overflow.

Closed last year.

I have been writing some codes to get some data from some pages in Java and Jsoup was on of the best libraries to work with. But, Unfortunately I have to port the whole code to C/C++. But I a cannot find any decent html parser to use on c++. Is there any Jsoup like library for C++ or How can similar results be achieved?

[Currently I am using Curl to get the source of the pages and roaming the internet to find a html parser]

回答1:

Unfortunately, i guess there's no parser like Jsoup for C++ ...

Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries

For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).

LibXml

push and pull parser (DOM, SAX)
Validation
XPath and XPointer support
Cross-Plattform / good documentation

Apache Xerxces

push and pull parser (DOM, SAX)
Validation
No XPath support (but a package for this?)
Cross-Plattform / good documentation

If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.

Some more:

htmlcxx - html and css APIs for C++
MSHTML (?)
pugixml (DOM / XPath and Unicode support)
LibCSS (CSS Parser) / LibDOM (DOM) (however, both in C)
hcxselect (CSS selector engine for C++)

Maybe you can combine a DOM Model / Parser and a CSS selector together?

回答2:

If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).

Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.

回答3:

Chromium has an open source parser. Also, the Google gumbo-parser looks cool.

回答4:

You can use xerces2 as DOM parser.

Or use HTML Tidy to clean up the HTML and convert it to XHTML then parse the XML with pugixml or similar XML parser. And since pugixml is a non-validating parser, it might as well work on the raw HTML without the need of runnin HTML Tidy on it first.

回答5:

If you don't mind calling out to python from C++, you could use Beautiful Soup. At least the name is right!

Seriously - it's a nice, no-nonsense HTML parser. I haven't tried calling out to it from C++, although it should be straightforwards.

回答6:

Yes, there is a html parser lib for c++, check it out https://github.com/HamedMasafi/HtmlParser/

This library can parse html or css and convert it to a tree model. You can search in parsed html by methods like: get_by_id, get_by_class_name, get_by_tag_name, and also there is a question method that you can search via css selector (only tag, id, class, nested childs selectors supported for now).

After finding a child you can change it's attributes and in final you can print a html into std::string in compact and pretty mode.