Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed last year.
I have been writing some codes to get some data from some pages in Java and Jsoup was on of the best libraries to work with. But, Unfortunately I have to port the whole code to C/C++. But I a cannot find any decent html parser to use on c++. Is there any Jsoup like library for C++ or How can similar results be achieved?
[Currently I am using Curl to get the source of the pages and roaming the internet to find a html parser]
Unfortunately, i guess there's no parser like Jsoup for C++ ...
Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries
For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).
LibXml
- push and pull parser (DOM, SAX)
- Validation
- XPath and XPointer support
- Cross-Plattform / good documentation
Apache Xerxces
- push and pull parser (DOM, SAX)
- Validation
- No XPath support (but a package for this?)
- Cross-Plattform / good documentation
If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.
Some more:
- htmlcxx - html and css APIs for C++
- MSHTML (?)
- pugixml (DOM / XPath and Unicode support)
- LibCSS (CSS Parser) / LibDOM (DOM) (however, both in C)
- hcxselect (CSS selector engine for C++)
Maybe you can combine a DOM Model / Parser and a CSS selector together?
If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).
Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.
Chromium has an open source parser. Also, the Google gumbo-parser looks cool.
You can use xerces2 as DOM parser.
Or use HTML Tidy to clean up the HTML and convert it to XHTML then parse the XML with pugixml or similar XML parser. And since pugixml is a non-validating parser, it might as well work on the raw HTML without the need of runnin HTML Tidy on it first.
If you don't mind calling out to python from C++, you could use Beautiful Soup. At least the name is right!
Seriously - it's a nice, no-nonsense HTML parser. I haven't tried calling out to it from C++, although it should be straightforwards.
Yes, there is a html parser lib for c++, check it out
https://github.com/HamedMasafi/HtmlParser/
This library can parse html or css and convert it to a tree model. You can search in parsed html by methods like: get_by_id, get_by_class_name, get_by_tag_name, and also there is a question method that you can search via css selector (only tag, id, class, nested childs selectors supported for now).
After finding a child you can change it's attributes and in final you can print a html into std::string in compact and pretty mode.