Capture groups not working in NSRegularExpression

2019-01-16 18:53发布

问题:

Why is this code only spitting out the entire regex match instead of the capture group?

Input

@"A long string containing Name:</td><td>A name here</td> amongst other things"

Output expected

A name here

Actual output

Name:</td><td>A name here</td>

Code

NSString *htmlString = @"A long string containing Name:</td><td>A name here</td> amongst other things";
NSRegularExpression *nameExpression = [NSRegularExpression regularExpressionWithPattern:@"Name:</td>.*\">(.*)</td>" options:NSRegularExpressionSearch error:nil];

NSArray *matches = [nameExpression matchesInString:htmlString
                                  options:0
                                    range:NSMakeRange(0, [htmlString length])];
for (NSTextCheckingResult *match in matches) {
    NSRange matchRange = [match range];
    NSString *matchString = [htmlString substringWithRange:matchRange];
    NSLog(@"%@", matchString);
}

Code taken from Apple docs. I know there are other libraries to do this but i want to stick with what's built in for this task.

回答1:

You will access the first group range using :

for (NSTextCheckingResult *match in matches) {
    //NSRange matchRange = [match range];
    NSRange matchRange = [match rangeAtIndex:1];
    NSString *matchString = [htmlString substringWithRange:matchRange];
    NSLog(@"%@", matchString);
}


回答2:

Don't parse HTML with regular expressions or NSScanner. Down that path lies madness.

This has been asked many times on SO.

parsing HTML on the iPhone

The data i am picking out is as simple as <td>Name: A name</td> and i think its simple enough to just use regular expressions instead of including a full blown HTML parser in the project.

Up to you and I'm a strong advocate for "first to market has huge advantage".

The difference being that with a proper HTML parser, you are considering the structure of the document. Using regular expressions, you are relying on the document never changing format in ways that are syntactically otherwise perfectly valid.

I.e. what if the input were <td class="name">Name: A name</td>? Your regex parser just broke on input that is both valid HTML and, from a tag contents perspective, identical to the original input.



回答3:

HTML isn't a regular language and can't be properly parsed using regular expressions. Here's a classic SO answer explaining this common programmer misassumption.



回答4:

In swift3

//: Playground - noun: a place where people can play

import UIKit

/// Two groups. 1: [A-Z]+, 2: [0-9]+
var pattern = "([A-Z]+)([0-9]+)"

let regex = try NSRegularExpression(pattern: pattern, options:[.caseInsensitive])

let str = "AA01B2C3DD4"
let strLen = str.characters.count
let results = regex.matches(in: str, options: [], range: NSMakeRange(0, strLen))

let nsStr = str as NSString

for a in results {

    let c = a.numberOfRanges 
    print(c)

    let m0 = a.rangeAt(0)  //< Ex: 'AA01'
    let m1 = a.rangeAt(1)  //< Group 1: Alpha chars, ex: 'AA'
    let m2 = a.rangeAt(2)  //< Group 2: Digital numbers, ex: '01'
    // let m3 = a.rangeAt(3) //< Runtime exceptions

    let s = nsStr.substring(with: m2)
    print(s)
}


回答5:

Or just use

[htmlString firstMatchedGroupWithRegex:@"Name:</td>.*\">(.*)</td>"]

from this category https://github.com/damienromito/NSString-Matcher