I have a string like "39 3A 3B 9:;"
and i want to extract "39, 3A, 3B"
I have tried
my $a = "39 3A 3B 9:;";
grammar Hex {
token TOP { <hex_array>+ .* }
token hex_array { <[0..9 A..F]> " " }
};
Hex.parse($a);
But this doesn't seem to work. And even this doesn't seem to work.
my $a = "39 3A 3B ";
grammar Hex {
token TOP { <hex_array>+ }
token hex_array { <[0..9 A..F]> " " }
};
Hex.parse($a);
I did try Grammar::Tracer both TOP and hex_array fail
TOP
| hex_array
| * FAIL
* FAIL
<[abcdef...]>
in a P6 regex is a "character class" in the match-one-character sense.1The idiomatic way to get what you want is to use the
**
quantifier:The rest of this answer is "bonus" material on why and how to use
rule
s.You are of course perfectly free to match whitespace situations by including whitespace patterns in arbitrary individual tokens, like you did with
" "
in yourhex_array
token.However, it's good practice to use
rule
s instead when appropriate -- which is most of the time.First, use
ws
instead of " ",\s*
etc.Let's remove the space in the second
token
and move it instead to the first one:We've added square bracketing (
[...]
) that combines thehex_array
and a space and then applied the+
quantifier to that combined atom. It's a simple change, and the grammar continues to work as before, matching the space as before, except now the space won't be captured by thehex_array
token.Next, let's switch to using the built in
ws
token
:The default
<ws>
is more generally useful, in desirable ways, than\s*
.2 And if the defaultws
doesn't do what you need you can specify your ownws
token.We've used
<.ws>
instead of<ws>
because, like\s*
, use of<.ws>
avoids additional capture of whitespace that would likely just clutter up the parse tree and waste memory.One often wants something like
<.ws>
after almost every token in higher level parsing rules that string tokens together. But if it were just explicitly written like that it would be highly repetitive and distracting<.ws>
and[ ... <.ws> ]
boilerplate. To avoid that there's a built in shortcut for implicitly expressing a default assumption of inserting the boilerplate for you. This shortcut is therule
declarator, which in turn uses:sigspace
.Using
rule
(which uses:sigspace
)A
rule
is exactly the same as atoken
except that it switches on:sigspace
at the start of the pattern:Without
:sigspace
(so intoken
s andregex
s by default), all literal spaces in a pattern (unless you quote them) are ignored. This is generally desirable for readable patterns of individualtoken
s because they typically specify literal things to match.But once
:sigspace
is in effect, spaces after atoms become "significant" -- because they're implicitly converted to<.ws>
or[ ... <.ws> ]
calls. This is desirable for readable patterns specifying sequences of tokens or subrules because it's a natural way to avoid the clutter of all those extra calls.The first pattern below will match one or more
hex_array
tokens with no spaces being matched either between them or at the end. The last two will match one or morehex_array
s, without intervening spaces, and then with or without spaces at the very end:NB. Adverbs (like
:sigspace
) aren't atoms. Spaces immediately before the first atom (in the above, spaces before<hex_array>
) are never significant (regardless of whether:sigspace
is or isn't in effect). But thereafter, if:sigspace
is in effect, all non-quoted spacing in the pattern is "significant" -- that is, it's converted to<.ws>
or[ ... <.ws> ]
.In the above code, the second token and the rule would match a single
hex_array
with spaces after it because the space immediately after the+
and before the}
means the pattern is rewritten to:But this rewritten token won't match if your input has multiple
hex_array
tokens with one or more spaces between them. Instead you would want to write:which is rewritten to:
This will match your input.
Conclusion
So, after all that apparent complexity, which is really just me being exhaustively precise, I'm suggesting you might write your original code as:
and this would match more flexibly than your original (I'm presuming that would be a good thing though of course it might not be for some use cases) and would perhaps be easier to read for most P6ers.
Finally, to reinforce how to avoid two of the three gotchyas of
rule
s, see also What's the best way to be lax on whitespace in a perl6 grammar?. (The third gotchya is whether you need to put a space between an atom and a quantifier, as with the space between the<hex_array>
and the+
in the above.)Footnotes
1 If you want to match multiple characters, then append a suitable quantifier to the character class. This is a sensible way for things to be, and the assumed behavior of a "character class" according to Wikipedia. Unfortunately the P6 doc currently confuses the issue, eg lumping together both genuine character classes and other rules that match multiple characters under the heading Predefined character classes.
2 The default
ws
rule is designed to match between words, where a "word" is a contiguous sequence of letters (Unicode category L), digits (Nd), or underscores. In code, it's specified as:ww
is a "within word" test. So<!ww>
means not within a "word".<ws>
will always succeed where\s*
would -- except that, unlike\s*
, it won't succeed in the middle of a word. (Like any other atom quantified with a*
, a plain\s*
will always match because it matches any number of spaces, including none at all.)If you don't need to use grammars, you can do this:
The regex will match these 2-digit hexa strings. Anyway, the problem with your grammar might be in the number of spaces you're using; they are very strict in that sense.