parsing HTML on the iPhone [closed]

2018-12-31 15:13发布

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.

Does such a library exist, or am I better off just trying to use regular expressions?

9条回答
浪荡孟婆
2楼-- · 2018-12-31 15:52

How about using the Webkit component, and possibly third party packages such as jquery for tasks such as these? Wouldn't it be possible to fetch the html data in an invisible component and take advantage of the very mature selectors of the javascript frameworks?

查看更多
有味是清欢
3楼-- · 2018-12-31 16:00

We use Convertigo to parse HTML on the server side and return a clean and neat JSON web services to our Mobile Apps

查看更多
临风纵饮
4楼-- · 2018-12-31 16:02

Looks like libxml2.2 comes in the SDK, and libxml/HTMLparser.h claims the following:

This module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.

That sounds like what I need, so I'm probably going to use that.

查看更多
临风纵饮
5楼-- · 2018-12-31 16:03

This probably depends on how messy the HTML is and what you want to extract. But usually Tidy does quite a good job. It is written in C and I guess you should be able to build and statically link it for the iPhone. You can easily install the command line version and test the results first.

查看更多
倾城一夜雪
6楼-- · 2018-12-31 16:06

I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .

Requirements:

-Add libxml2 includes to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Header Search Paths"
  3. Add a new search path "${SDKROOT}/usr/include/libxml2"
  4. Enable recursive option

-Add libxml2 library to to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Other Linker Flags"
  3. Add a new search flag "-lxml2"

-From hpple get the following source code files an add them to your project:

  1. TFpple.h
  2. TFpple.m
  3. TFppleElement.h
  4. TFppleElement.m
  5. XPathQuery.h
  6. XPathQuery.m

-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.

Code Example

#import "TFHpple.h"

NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];

// Create parser
xpathParser = [[TFHpple alloc] initWithHTMLData:data];

//Get all the cells of the 2nd row of the 3rd table 
NSArray *elements  = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"];

// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];

// Get the text within the cell tag
NSString *content = [element content];  

[xpathParser release];
[data release];

Known issues

As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.

查看更多
浪荡孟婆
7楼-- · 2018-12-31 16:06

Google's GData Objective-C API reimplements NSXMLElement and other related classes that Apple removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/. I've used it for dealing messaging via Jabber. Of course if your HTML is malformed (missing closing tags) this might not help much.

查看更多
登录 后发表回答