Remove HTML Tags from an NSString on the iPhone

2018-12-31 09:09发布

There are a couple of different ways to remove HTML tags from an NSString in Cocoa.

One way is to render the string into an NSAttributedString and then grab the rendered text.

Another way is to use NSXMLDocument's -objectByApplyingXSLTString method to apply an XSLT transform that does it.

Unfortunately, the iPhone doesn't support NSAttributedString or NSXMLDocument. There are too many edge cases and malformed HTML documents for me to feel comfortable using regex or NSScanner. Does anyone have a solution to this?

One suggestion has been to simply look for opening and closing tag characters, this method won't work except for very trivial cases.

For example these cases (from the Perl Cookbook chapter on the same subject) would break this method:

<IMG SRC = "foo.gif" ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

22条回答
刘海飞了
2楼-- · 2018-12-31 09:16

I have following the accepted answer by m.kocikowski and modified is slightly to make use of an autoreleasepool to cleanup all of the temporary strings that are created by stringByReplacingCharactersInRange

In the comment for this method it states, /* Replace characters in range with the specified string, returning new string. */

So, depending on the length of your XML you may be creating a huge pile of new autorelease strings which are not cleaned up until the end of the next @autoreleasepool. If you are unsure when that may happen or if a user action could repeatedly trigger many calls to this method before then you can just wrap this up in an @autoreleasepool. These can even be nested and used within loops where possible.

Apple's reference on @autoreleasepool states this... "If you write a loop that creates many temporary objects. You may use an autorelease pool block inside the loop to dispose of those objects before the next iteration. Using an autorelease pool block in the loop helps to reduce the maximum memory footprint of the application." I have not used it in the loop, but at least this method cleans up after itself now.

- (NSString *) stringByStrippingHTML {
    NSString *retVal;
    @autoreleasepool {
        NSRange r;
        NSString *s = [[self copy] autorelease];
        while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound) {
            s = [s stringByReplacingCharactersInRange:r withString:@""];
        }
        retVal = [s copy];
    } 
    // pool is drained, release s and all temp 
    // strings created by stringByReplacingCharactersInRange
    return retVal;
}
查看更多
泪湿衣
3楼-- · 2018-12-31 09:17

use this

NSString *myregex = @"<[^>]*>"; //regex to remove any html tag

NSString *htmlString = @"<html>bla bla</html>";
NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""];

don't forget to include this in your code : #import "RegexKitLite.h" here is the link to download this API : http://regexkit.sourceforge.net/#Downloads

查看更多
柔情千种
4楼-- · 2018-12-31 09:19

I would imagine the safest way would just be to parse for <>s, no? Loop through the entire string, and copy anything not enclosed in <>s to a new string.

查看更多
长期被迫恋爱
5楼-- · 2018-12-31 09:19

If you are willing to use Three20 framework, it has a category on NSString that adds stringByRemovingHTMLTags method. See NSStringAdditions.h in Three20Core subproject.

查看更多
唯独是你
6楼-- · 2018-12-31 09:23

This NSString category uses the NSXMLParser to accurately remove any HTML tags from an NSString. This is a single .m and .h file that can be included into your project easily.

https://gist.github.com/leighmcculloch/1202238

You then strip html by doing the following:

Import the header:

#import "NSString_stripHtml.h"

And then call stripHtml:

NSString* mystring = @"<b>Hello</b> World!!";
NSString* stripped = [mystring stripHtml];
// stripped will be = Hello World!!

This also works with malformed HTML that technically isn't XML.

查看更多
临风纵饮
7楼-- · 2018-12-31 09:24

You can use like below

-(void)myMethod
 {

 NSString* htmlStr = @"<some>html</string>";
 NSString* strWithoutFormatting = [self stringByStrippingHTML:htmlStr];

 }

 -(NSString *)stringByStrippingHTML:(NSString*)str
 {
   NSRange r;
   while ((r = [str rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location     != NSNotFound)
  {
     str = [str stringByReplacingCharactersInRange:r withString:@""];
 }
  return str;
 }
查看更多
登录 后发表回答