NSXMLParser doesn't ignore CDATA

2019-08-04 17:44发布

问题:

im pretty new in ios development and im tryin to parse an RSS file(xml).

here is the xml: (sorry for the language)

<item>
<category> General < / category >
<title> killed in a tractor accident , was critically injured windsurfer </ title>
<description>
< ! [ CDATA [
<div> <a href='http://www.ynet.co.il/articles/0,7340,L-4360016,00.html'> <img src = 'http://www.ynet.co. il/PicServer3/2012/11/28/4302844/YOO_8879_a.jpg ' alt =' photo: Yaron Brener 'title =' Amona 'border = '0' width = '116 'height = '116'> </ a> < / div >
] ] >
Tractor driver in his 50s near Kfar Yuval flipped and trapped underneath . Room was critically injured windsurfer hurled rocks because of strong winds and wind surfer after was moderately injured in Netanya
< / description >
<link>
http://www.ynet.co.il/articles/0 , 7340, L- 4360016 , 00.html
< / link >
<pubDate> Fri, 22 Mar 2013 17:10:15 +0200 </ pubDate>
<guid>
http://www.ynet.co.il/articles/0 , 7340, L- 4360016 , 00.html
< / guid >
<tags> Kill , car accidents , surfing < / tags >
< / item >

and here is my xmlparser code:

    - (void)parserDidStartDocument:(NSXMLParser *)parser
    {
       self.titles = [[NSMutableArray alloc]init];
       self.descriptions = [[NSMutableArray alloc]init];
        self.links = [[NSMutableArray alloc]init];
    }

- (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName attributes:(NSDictionary *)attributeDict
{
    if ([elementName isEqualToString:@"item"]) {
        isItem = YES;
    }

    if ([elementName isEqualToString:@"title"]) {
        isTitle=YES;
        self.titlesString = [[NSMutableString alloc]init];
    }

    if ([elementName isEqualToString:@"description"]) {
        isDesription = YES;
        self.descriptionString = [NSMutableString string];
        self.data = [NSMutableData data];
    }



}

- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string{
    if(isItem && isTitle){
        [self.titlesString appendString:string];
    }
    if (isItem && isDesription) {
        if (self.descriptionString)
            [self.descriptionString appendString:string];
    }






}

- (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock
{
    if (self.data)
        [self.data appendData:CDATABlock];

}


- (void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName
{
    if ([elementName isEqualToString:@"item"]) {
        isItem = NO;
        [self.titles addObject:self.titlesString];

        [self.descriptions addObject:self.descriptionString];


    }

    if ([elementName isEqualToString:@"title"]) {
        isTitle=NO;

    }
    if ([elementName isEqualToString:@"description"]) {

        NSString *result = [self.descriptionString stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
        NSLog(@"string=%@", result);


        if ([self.data length] > 0)
        {
            NSString *htmlSnippet = [[NSString alloc] initWithData:self.data encoding:NSUTF8StringEncoding];
            NSString *imageSrc = [self firstImgUrlString:htmlSnippet];
            NSLog(@"img src=%@", imageSrc);
            [self.links addObject:imageSrc];
        }



        self.descriptionString = nil;
        self.data = nil;
    }


}

- (NSString *)firstImgUrlString:(NSString *)string
{
    NSError *error = NULL;
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"(<img\\s[\\s\\S]*?src\\s*?=\\s*?['\"](.*?)['\"][\\s\\S]*?>)+?"
                                                                           options:NSRegularExpressionCaseInsensitive
                                                                             error:&error];

    NSTextCheckingResult *result = [regex firstMatchInString:string
                                                     options:0
                                                       range:NSMakeRange(0, [string length])];

    if (result)
        return [string substringWithRange:[result rangeAtIndex:2]];

    return nil;
}

@end

Like I said I'm pretty new to iPhone development, I looked for ways to solve it for several hours but found nothing. I decided to open a topic, then a few questions:

One. The parser does not ignore what CDATA is just doing parse everything. Why is this happening? As you can see the description itself is not in cdata and I I have only the first step but I get the rest even when I'm not using foundCDATA: (NSData *) CDATABlock

Two. I want to take the image link, how to do it? I searched online and found a lot of guide explains only use the function foundCDATA: (NSData *) CDATABlock But how is it used? The way in which I used in the code?

Please I need an explanation so I can understand, thank you!

回答1:

In answer to your two questions:

  1. The parser will, if you have implemented foundCDATA, will parse the description CDATA in that method, and not in foundCharacters. If, on the other hand, you have not implemented foundCDATA, the CDATA will be parsed by foundCharacters. So, if you don't want foundCharacters to parse the CDATA, then you have to implement foundCDATA.

  2. If you want to extract the img URL, you have to parse the HTML you received somehow. You can use Hpple, but I might just be inclined to use a regular expression:

    - (NSString *)firstImgUrlString:(NSString *)string
    {
        NSError *error = NULL;
        NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"(<img\\s[\\s\\S]*?src\\s*?=\\s*?['\"](.*?)['\"][\\s\\S]*?>)+?"
                                                                               options:NSRegularExpressionCaseInsensitive
                                                                                 error:&error];
    
        NSTextCheckingResult *result = [regex firstMatchInString:string
                                                         options:0
                                                           range:NSMakeRange(0, [string length])];
    
        if (result)
            return [string substringWithRange:[result rangeAtIndex:2]];
    
        return nil;
    }
    

    Also see this other Stack Overflow answer in which I demonstrate both Hpple and regex solutions:


As an example, here is the NSXMLParserDelegate methods that will parse the description, putting the text (excluding the CDATA) in one field, and putting the image URL from the CDATA in another variable. You'll have to modify to accommodate your process, but hopefully this gives you the basic idea:

- (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName attributes:(NSDictionary *)attributeDict
{
    if ([elementName isEqualToString:@"description"])
    {
        self.string = [NSMutableString string];
        self.data = [NSMutableData data];
    }
}

- (void)parser:(NSXMLParser *)parser parseErrorOccurred:(NSError *)parseError
{
    NSLog(@"%s, parseError=%@", __FUNCTION__, parseError);
}

// In my standard NSXMLParser routine, I leave self.string `nil` when not parsing 
// a particular element, and initialize it if I am parsing. I do it this way
// so only my `didStartElement` and `didEndElement` need to worry about the particulars
// and my `foundCharacters` and `foundCDATA` are simplified. But do it however you
// want.

- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string
{
    if (self.string)
        [self.string appendString:string];
}

- (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock
{
    if (self.data)
        [self.data appendData:CDATABlock];
}

- (void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName
{
    if ([elementName isEqualToString:@"description"])
    {
        // get the text (non-CDATA) portion

        // you might want to get rid of the leading and trailing whitespace

        NSString *result = [self.string stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
        NSLog(@"string=%@", result);

        // get the img out of the CDATA

        if ([self.data length] > 0)
        {
            NSString *htmlSnippet = [[NSString alloc] initWithData:self.data encoding:NSUTF8StringEncoding];
            NSString *imageSrc = [self firstImgUrlString:htmlSnippet];
            NSLog(@"img src=%@", imageSrc);
        }

        // once I've saved the data where I want to save it, I `nil` out my
        // `string` and `data` properties:

        self.string = nil;
        self.data = nil;
    }
}


回答2:

Answer 1: I will go along with the answer given by Rob for this question.

Answer 2: Just try this to get the image link....

- (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName attributes:(NSDictionary *)attributeDict
{   
    if([currentElement isEqualToString:@"img"]) {
        NSLog(@"%@",[attributeDict objectForKey:@"src"]);
    }
}


回答3:

The image link your are looking to extract is inside a CDATA block, but rss parser ignores CDATA block.

If you need to extract the string from CDATA, you could use this block in foundCDATA:

    - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock
    {

    NSMutableString *cdstring = [[NSMutableString alloc] initWithData:CDATABlock encoding:NSUTF8StringEncoding];
    }

now the mutablestring "cdstring" will be containing:

    <div>
    <a href='http://www.ynet.co.il/articles/0,7340,L-4360016,00.html'>
    <img src='http://www.ynet.co. il/PicServer3/2012/11/28/4302844/YOO_8879_a.jpg ' alt=' photo: Yaron Brener ' title=' Amona ' border='0' width='116 ' height='116'>
    </ a>
    </ div>
    ]]>

now just search for href=' using stringcontainsstring and extract the link or use a webview to just display

 htmlContent = [NSString stringWithFormat:@"%@", cdstring];
    [webView loadHTMLString:htmlContent baseURL:nil];