Regexp in C - match group

2019-05-19 09:54发布

问题:

I've been struggling with regular expressions in C (just /usr/include/regex.h).


I have (let's say) hundreds of regexps and one of them can match input string. Currently I'm doing it (generating it actually) like this: hundreds of do-while with match inside, break if not matching and going to another. One by one:

do {
    if ( regex_match(str, my_regex1) != MY_REGEX_SUCCESS ) DO_FAIL; //break
    ...
    if ( sscanf(str, " %d.%d.%d.%d / %d ", &___ip1, &___ip2, &___ip3, &___ip4, &___pref) != 5 ) DO_FAIL; //break
    ...
} while (0);

do {
    if ( regex_match(str, my_regex2) != MY_REGEX_SUCCESS ) DO_FAIL; //break
    ...
    ...
} while (0);

do {
    if ( regex_match(str, my_regex3) != MY_REGEX_SUCCESS ) DO_FAIL; //break
    ...
    ...
} while (0);

What I'd like to have is something like:

const char * match1 = "^([[:space:]]*)([$]([._a-zA-Z0-9-]{0,118})?[._a-zA-Z0-9])([[:space:]]*)$";
const char * match2 = "^([[:space:]]*)(target|origin)([[:space:]]*):([[:space:]]*)([$]([._a-zA-Z0-9-]{0,118})?[._a-zA-Z0-9])([[:space:]]*):([[:space:]]*)\\*([[:space:]]*)$";
const char * match3 = "^([[:space:]]*)(target|origin)([[:space:]]*):([[:space:]]*)([$]([._a-zA-Z0-9-]{0,118})?[._a-zA-Z0-9])([[:space:]]*)/([[:space:]]*)(([0-2]?[0-9])|(3[0-2]))([[:space:]]*):([[:space:]]*)(([1-9][0-9]{0,3})|([1-5][0-9]{4})|(6[0-4][0-9]{3})|(65[0-4][0-9]{2})|(655[0-2][0-9])|(6553[0-5]))([[:space:]]*)$";
char * my_match;
asprintf(&my_match, "(%s)|(%s)|(%s)", match1, match2, match3);


int num_gr = give_me_number_of_regex_group(str, my_match)
switch (num_gr) {
    ...
}

and don't have an idea how to do that...

Any suggestions?
Thanks!

回答1:

I assume your regex_match is some combination of regcomp and regexec. To enable grouping, you need to call regcomp with the REG_EXTENDED flag, but without the REG_NOSUB flag (in the third argument).

regex_t compiled;
regcomp(&compiled, "(match1)|(match2)|(match3)", REG_EXTENDED);

Then allocate space for the groups. The number of groups is stored in compiled.re_nsub. Pass this number to regexec:

size_t ngroups = compiled.re_nsub + 1;
regmatch_t *groups = malloc(ngroups * sizeof(regmatch_t));
regexec(&compiled, str, ngroups, groups, 0);

Now, the first invalid group is the one with a -1 value in both its rm_so and rm_eo fields:

size_t nmatched;
for (nmatched = 0; nmatched < ngroups; nmatched++)
    if (groups[nmatched].rm_so == (size_t)(-1))
        break;

nmatched is the number of parenthesized subexpressions (groups) matched. Add your own error checking.



回答2:

You could have them give you a array of strings that contain your regexps and test each one of them.

//count is the number of regexps provided
int give_me_number_of_regex_group(const char *needle,const char** regexps, int count ){
  for(int i = 0; i < count; ++i){
    if(regex_match(needle, regexp[i])){
      return i;
    }
  }
  return -1; //didn't match any
}

or am i overseeing something?



回答3:

"I have (let's say) hundreds of regexps ..."

It looks like you are trying to comparing the quad parts of ip addresses. In general, in using regular expressions, its usually a red flag when using that many regex's on a single target and stopping after a match.

example: Which group will correctly match first?
target ~'American' , pattern ~ /(Ame)|(Ameri)|(American)/
This does not even include quantifiers in the subgroups.

If its the case of a constant form the regex's are composed of/from, for instance data, it might be better to use C's string functions to split out the data from the form into an array, then compare the array items with the target. C is much faster for this than regex's.