Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.
Does such a library exist, or am I better off just trying to use regular expressions?
Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.
Does such a library exist, or am I better off just trying to use regular expressions?
How about using the Webkit component, and possibly third party packages such as jquery for tasks such as these? Wouldn't it be possible to fetch the html data in an invisible component and take advantage of the very mature selectors of the javascript frameworks?
We use Convertigo to parse HTML on the server side and return a clean and neat JSON web services to our Mobile Apps
Looks like
libxml2.2
comes in the SDK, andlibxml/HTMLparser.h
claims the following:That sounds like what I need, so I'm probably going to use that.
This probably depends on how messy the HTML is and what you want to extract. But usually Tidy does quite a good job. It is written in C and I guess you should be able to build and statically link it for the iPhone. You can easily install the command line version and test the results first.
I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .
Requirements:
-Add libxml2 includes to your project
-Add libxml2 library to to your project
-From hpple get the following source code files an add them to your project:
-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.
Code Example
Known issues
As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.
Google's GData Objective-C API reimplements NSXMLElement and other related classes that Apple removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/. I've used it for dealing messaging via Jabber. Of course if your HTML is malformed (missing closing tags) this might not help much.