Update/Note:
I think what I'm probably looking for is to get the captures of a group in PHP.
Referenced: PCRE regular expressions using named pattern subroutines.
(Read carefully:)
I have a string that contains a variable number of segments (simplified):
$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well
I would like now to match the segments and return them via the matches array:
$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);
This will only return the last match for the capture group 2: DD
.
Is there a way that I can retrieve all subpattern captures (AA
, BB
, DD
) with one regex execution? Isn't preg_match_all
suitable for this?
This question is a generalization.
Both the $subject
and $pattern
are simplified. Naturally with such the general list of AA
, BB
, .. is much more easy to extract with other functions (e.g. explode
) or with a variation of the $pattern
.
But I'm specifically asking how to return all of the subgroup matches with the preg_...
-family of functions.
For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.
Example
This is an example in pseudo code to describe a bit of the background. Imagine the following:
Regular definitions of tokens:
CHARS := [a-z]+
PUNCT := [.,!?]
WS := [ ]
$subject
get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).
That array is then transformed into a string, containing one character per token:
CHARS -> "c"
PUNCT -> "p"
WS -> "s"
So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.
regex: (cs)?cp
to express one or more group of chars followed by a punctuation.
As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:
words = word | (word space)+ word
word = CHARS+
space = WS
punctuation = PUNCT
If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.
words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+) # words resolved to tokens
words = (c+)|((c+)s)+c+ # words resolved to regex
I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.
So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.
That's basically all. Probably now it's understandable why I simplified the question.
Related:
Yes your right your solution is by using
preg_match_all
preg_match_all is recursive, so dont use start-with^
and end-with$
, so thatpreg_match_all
put all found patterns in an array.Each new pair of parenthesis will add a New arrays indicating the different matches
use
?
for optional matchesYou can Separate different groups of patterns reported with the parenthesis
()
to ask for a group to be found and added in a new array (can allow you to count matches, or to categorize each matches from the returned array )Clarification required
Let me try to understand you question, so that my answer match what you ask.
Your
$subject
is not a good exemple of what your looking for?You would like the pregmatch search, to split what you provided in
$subject
in to 4 categories , Words, Characters, Punctuation and white spaces ? and what about numbers?As well you would like The returned matches, to have the offsets of the matches specified ?
Does
$subject = 'aa.bb cc.dd EE FFF,GG';
better fit a real life exemple?I will take your basic exemple in
$subject
and make it work to give your exactly what your asked.So can you edit your
$subject
so that i better fit all the cases that you want to matchOriginal
'/^(([a-z]+) )+$/i';
Keep me posted, you can test your regexes here http://www.spaweditor.com/scripts/regex/index.php
Partial answer
/([a-z])([a-z]+)/i
AA BB DD CD
Similar thread: Get repeated matches with preg_match_all()
Check the chosen answer plus mine might be useful I will duplicate there:
From http://www.php.net/manual/en/regexp.reference.repetition.php :
I personally give up and going to do this in 2 steps.
EDIT:
I see in that other thread someone claimed that lookbehind method is able doing it.