Does anyone know what [<.ident>
] means?
First, as jj noted, <.ident>
only matches, it doesn't capture, because of the .
. For the rest of this answer I'll generally omit the .
because it makes no difference to the rule's meaning besides the capture aspect.
The <ident>
in the code you quoted calls a "rule" which does an equivalent of:
token ident {
[ <alpha> ] # First character can't be a number
[ <alnum> ]* # Other characters can be a number
}
What the official doc says
The official doc says, in effect:
Predefined Character Class... Matches...
<ident> Identifier. Also a default rule.
But this is misleading doc.
It's not a character class. (Character class rules match a single character in that character class. <ident>
matches one or more characters that fit a pattern, albeit one that involves character classes.)
All rules are default rules. (I think the default comment is there to emphasize that you can write your own <ident>
rule if you don't like the built-in pattern, something that's also true but generally much less sensical for rules that correspond to canonical character classes such as <lower>
.)
It doesn't match identifiers, or more accurately it fails to match many Perl6 identifiers. (See the How the <ident>
rule is used in matching Perl 6 identifiers section at the end of this answer for an explanation of what it does as part of matching an identifier.)
The rest of this answer
Feel free to consider this question sufficiently answered and move on to something else. For those interested in more detail I've written the following sections:
The high level logic that calls the <ident>
rule
The mid level logic that's in the <ident>
rule
The machine level code that implements the <ident>
rule
The specification of the <ident>
rule
How the <ident>
rule is used in matching Perl 6 identifiers
The high level logic that calls the <ident>
rule
The grammar XML::Grammar
statement introduces a user defined Perl 6 grammar.
A grammar is a class. ("Grammars are really just slightly specialized classes".)
A rule is a method. (say rule { ... } ~~ Method; # True
.)
Any class created using a declaration of the form grammar foo { ... }
inherits from the Grammar
class which in turn inherits from the Match
class.
In the Rakudo Perl 6 compiler, the Match
class does
the role NQPMatchRole
.
NQPMatchRole
includes an ident
rule (in this case with a regular method
declaration).
Because grammar XML::Grammar
does not declare an ident
rule, the <ident>
call dispatches to the method in NQPMatchRole
following normal method dispatch logic.
The mid level logic that's in the <ident>
rule
NQPMatchRole
is written in the nqp language, a subset of Perl 6 used to bootstrap the full Perl 6, and the heart of NQP, a compiler toolkit.
Excerpting and reformatting just the most salient code from the ident
method declaration that's written in nqp, the rule begins with:
( nqp::ord($target, $!pos) == 95
|| nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos) )
This matches if the first character is either a _
(95
is the ASCII code / Unicode codepoint for an underscore) or a character matching a character class defined in NQP called CCLASS_ALPHABETIC
.
The other bit of salient code is:
nqp::findnotcclass( nqp::const::CCLASS_WORD
This matches zero or more subsequent characters in the character class CCLASS_WORD
.
A search of NQP for CCLASS_ALPHABETIC
shows several matches. The most useful seems to be an NQP test file. While this file makes it clear that CCLASS_WORD
is a superset of CCLASS_ALPHABETIC
, it doesn't make it clear what those classes actually match.
The machine level code that implements the <ident>
rule
NQP targets multiple "backends" or concrete virtual machines.
Given the relative paucity of Rakudo/NQP doc/tests of what these rules and character classes actually match, one has to look at one of these backends to verify what's what.
MoarVM is the only formally supported backend.
A search of MoarVM for CCLASS
shows several matches.
The important one seems to be ops.c which includes a switch (cclass)
statement which in turn includes cases for MVM_CCLASS_ALPHABETIC
and MVM_CCLASS_WORD
that correspond to NQP's similarly named constants.
According to the code's comments:
CCLASS_ALPHABETIC
currently matches exactly the same characters as the full Perl 6 or NQP <:L>
rule, i.e. the characters Unicode has classified as "Letters".
I think that means <alpha>
is equivalent to the union of CCLASS_ALPHABETIC
and _
(underscore).
CCLASS_WORD
matches the same plus <:Nd>
, i.e. decimal digits (in any human language, not just English).
I think that means the full Perl 6 level / NQP <alnum>
rule is equivalent to CCLASS_WORD
.
The specification of the <ident>
rule
The official specification of Perl 6 is embodied in roast1.
A search of roast for ident
shows several matches.
Most use <ident>
only incidentally. The specification requires that they work as shown, but you won't understand what <ident>
is supposed to do by looking at incidental usage.
Three tests clearly test <ident>
itself. One of those is essentially redundant, leaving two. I see no changes between the 6.c
and 6.c.errata
versions of these two matches:
From S05-mass/rx.t:
ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';
ok
tests that its first argument returns True
. This call tests that <ident>
skips 2+3
and matches ab2
.
From S05-mass/charsets.t:
is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';
is
tests that its first argument matches its second. This call tests what the ident
rule matches from a string consisting of the first 256 Unicode codepoints (the Latin-1 character set).
Here's a variation of this test that more clearly shows the matching that happens:
say ~$_ for $latin-chars ~~ m:g/<ident>/;
prints:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ
<ident>
actually matches a lot more than just a hundred or so characters from Latin-1. So, while the above tests cover what <ident>
is officially specified/tested to match, they clearly don't cover the full picture.
So let's look at the other source commonly associated with "specification", the historical Perl 6 design docs.
First, we note the warning at the top of the design docs:
Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.
The term "specs" in this warning is short for "specification". As already explained, the official specification test suite is roast.
(Some people still think of these historical design docs as "specifications" too, and refer to them as "specs", but a more apt way to look at things is that "specs" as applied to the design docs is short for "speculations" or perhaps "specious nonsense" if the former doesn't make the point clear enough that they're not something to be fully relied upon.)
A search for ident
in design.perl6.org shows several matches.
The most useful match is in the Predefined Subrules section of S05:
These are some of the predefined subrules for any grammar or regex:
- ident ... Match an identifier.
So, now we see where the docs got their notion of the meaning of <ident>
from.
How the <ident>
rule is used in matching Perl 6 identifiers
my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >;
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)
In nqp's grammar, which is defined in NQP's Grammar.nqp, there's:
token identifier { <.ident> [ <[\-']> <.ident> ]* }
In Perl 6's grammar, which is defined in Rakudo's Grammar.nqp, there's code that looks slightly different but has the exact same effect:
token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }
So <identifier>
matches a pattern that includes one or more <ident>
s.
The ident
method is in NQPMatchRole
, which means it's a built-in that's part of the rule namespace of users' grammars.
The identifier
methods are not exported by either Perl 6 or nqp so they're not part of the rule namespace of users' grammars.
If I write my own indentifier
token we can see it in action:
my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)
displays:
($bar %hash Foo your_ident my-ident)
To summarize the above and some other considerations:
<ident>
matches just parts of what <identifier>
matches (though they're the same for the simplest names). Consider is-prime
which is a Perl 6 identifier but contains two <ident>
matches;
<identifier>
matches just parts of "Perl 6 identifiers" (though they're the same for the simplest names). Consider infix:<+>
which is sometimes also referred to as a Perl 6 identifier but requires both an <identifier>
match and a colon pair pattern match;
Perl 6 identifiers match just parts of names (though they're the same for the simplest names). Consider Foo-Bar::Baz-Qux
which contains two <identifier>
matches (each in turn containing two <ident>
matches).
1 The official specification of Perl 6 is a test suite called roast -- the Repository Of All Specification Tests. The latest version of a specific branch of roast defines a specific version of Perl 6. So far, there have only been two official branches/versions of roast, and therefore Perl 6. The first was/is 6.c
aka 6.Christmas
. This was cut on Christmas day 2015 and has been deliberately left frozen since that day. The second is 6.c.errata
, which very conservatively adds corrections to 6.c
deemed backwards compatible enough and/or too important not to be available as the current official recommended version of Perl 6. An "officially compliant" Perl 6 compiler passes some official branch of roast. The Rakudo compiler passes 6.c.errata
.
If you read all the tests involving a feature in, say, the 6.c.errata
branch of roast, then you'll have read a full definition of the officially specified meaning of that feature for the 6.c.errata
version of the Perl 6 language.