How to get all captures of subgroup matches with p

2020-01-26 08:52发布

问题:

Update/Note:

I think what I'm probably looking for is to get the captures of a group in PHP.

Referenced: PCRE regular expressions using named pattern subroutines.

(Read carefully:)


I have a string that contains a variable number of segments (simplified):

$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well

I would like now to match the segments and return them via the matches array:

$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);

This will only return the last match for the capture group 2: DD.

Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?

This question is a generalization.

Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.

But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.

For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.

Example

This is an example in pseudo code to describe a bit of the background. Imagine the following:

Regular definitions of tokens:

   CHARS := [a-z]+
   PUNCT := [.,!?]
   WS := [ ]

$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).

That array is then transformed into a string, containing one character per token:

   CHARS -> "c"
   PUNCT -> "p"
   WS -> "s"

So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.

   regex: (cs)?cp

to express one or more group of chars followed by a punctuation.

As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:

   words = word | (word space)+ word
   word = CHARS+
   space = WS
   punctuation = PUNCT

If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.

  words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+)    # words resolved to tokens
  words = (c+)|((c+)s)+c+                         # words resolved to regex

I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.

So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.

That's basically all. Probably now it's understandable why I simplified the question.


Related:

  • pcrepattern man page
  • Get repeated matches with preg_match_all()

回答1:

Try this:

preg_match_all("'[^ ]+'i",$text,$n);

$n[0] will contain an array of all non-space character groups in the text.

Edit: with subgroups:

preg_match_all("'([^ ]+)'i",$text,$n);

Now $n[1] will contain the subgroup matches, that are exactly the same as $n[0]. This is pointless actually.

Edit2: nested subgroups example:

$test = "Hello I'm Joe! Hi I'm Jane!";
preg_match_all("/(H(ello|i)) I'm (.*?)!/i",$test,$n);

And the result:

Array
(
    [0] => Array
        (
            [0] => Hello I'm Joe!
            [1] => Hi I'm Jane!
        )

    [1] => Array
        (
            [0] => Hello
            [1] => Hi
        )

    [2] => Array
        (
            [0] => ello
            [1] => i
        )

    [3] => Array
        (
            [0] => Joe
            [1] => Jane
        )

)


回答2:

Similar thread: Get repeated matches with preg_match_all()

Check the chosen answer plus mine might be useful I will duplicate there:

From http://www.php.net/manual/en/regexp.reference.repetition.php :

When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.

I personally give up and going to do this in 2 steps.

EDIT:

I see in that other thread someone claimed that lookbehind method is able doing it.



回答3:

Is there a way that I can retrieve all matches (AA, BB, DD) with one regex execution? Isn't preg_match_all not suitable for this?

Your current regex seems to be for a preg_match() call. Try this instead:

$pattern = '/[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);

Per comments, the ruby regex I mentioned:

sentence = %r{
(?<subject>   cat   | dog        ){0}
(?<verb>      eats  | drinks     ){0}
(?<object>    water | bones      ){0}
(?<adjective> big   | smelly     ){0}
(?<obj_adj>   (\g<adjective>\s)? ){0}
The\s\g<obj_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x

md = sentence.match("The cat drinks water");
md = sentence.match("The big dog eats smelly bones");

But I think you'll need a lexer/parser/tokenizer to do the same kind of thing in PHP. :-|



回答4:

You can't extract the subpatterns because the way you wrote your regex returns only one match (using ^ and $ at the same time, and + on the main pattern).

If you write it this way, you'll see that your subgroups are correctly there:

$pattern = '/(([a-z]+) )/i';

(this still has an unnecessary set of parentheses, I just left it there for illustration)



回答5:

Edit

I didn't realize what you had originally asked for. Here is the new solution:

$result = preg_match_all('/[a-z]+/i', $subject, $matches);
$resultArr = ($result) ? $matches[0] : array();


回答6:

How about:

$str = 'AA BB CC';
$arr = preg_split('/\s+/', $str);
print_r($arr);

output:

(
    [0] => AA
    [1] => BB
    [2] => CC
)


回答7:

I may have misunderstood what you're describing. Are you just looking for a pattern for groups of letters with whitespace between?

// any subject containing words:
$subject = 'AfdfdfdA BdfdfdB DdD'; 
$subject = 'AA BB CC';
$subject = 'Af df dfdA Bdf dfdB DdD';

$pattern = '/(([a-z]+)\s)+[a-z]+/i';

$result = preg_match_all($pattern, $subject, $matches);
print_r($matches);
echo "<br/>";
print_r($matches[0]);  // this matches $subject
echo "<br/>".$result;


回答8:

Yes your right your solution is by using preg_match_all preg_match_all is recursive, so dont use start-with^ and end-with$, so that preg_match_all put all found patterns in an array.

Each new pair of parenthesis will add a New arrays indicating the different matches

use ? for optional matches

You can Separate different groups of patterns reported with the parenthesis () to ask for a group to be found and added in a new array (can allow you to count matches, or to categorize each matches from the returned array )

Clarification required

Let me try to understand you question, so that my answer match what you ask.

  1. Your $subject is not a good exemple of what your looking for?

  2. You would like the pregmatch search, to split what you provided in $subject in to 4 categories , Words, Characters, Punctuation and white spaces ? and what about numbers?

  3. As well you would like The returned matches, to have the offsets of the matches specified ?

Does $subject = 'aa.bb cc.dd EE FFF,GG'; better fit a real life exemple?

I will take your basic exemple in $subject and make it work to give your exactly what your asked.

So can you edit your $subject so that i better fit all the cases that you want to match

Original '/^(([a-z]+) )+$/i';

Keep me posted, you can test your regexes here http://www.spaweditor.com/scripts/regex/index.php

Partial answer

/([a-z])([a-z]+)/i

AA BB DD CD

Array
(
    [0] => Array
        (
            [0] => AA
            [1] => BB
            [2] => DD
            [3] => CD
        )

    [1] => Array
        (
            [0] => A
            [1] => B
            [2] => D
            [3] => C
        )

    [2] => Array
        (
            [0] => A
            [1] => B
            [2] => D
            [3] => D
        )

)