-->

<.ident> function/capture in perl6 grammars

2019-06-21 02:54发布

问题:

While reading the Xml grammar for perl6 (https://github.com/supernovus/exemel/blob/master/lib/XML/Grammar.pm6), I am having some difficulties understanding the following token.

token pident {
  <!before \d> [ \d+ <.ident>* || <.ident>+ ]+ % '-'
}

More specifically <.ident>, there are no other definitions of ident, so I am assuming it is a reserved term. Though i cant find find a proper definition on perl6.org. Does anyone know what this means?

回答1:

Does anyone know what [<.ident>] means?

First, as jj noted, <.ident> only matches, it doesn't capture, because of the .. For the rest of this answer I'll generally omit the . because it makes no difference to the rule's meaning besides the capture aspect.

The <ident> in the code you quoted calls a "rule" which does an equivalent of:

token ident {
    [ <alpha> ]   # First character can't be a number
    [ <alnum> ]*  # Other characters can be a number
}

What the official doc says

The official doc says, in effect:

    Predefined Character Class...     Matches...
    <ident>                           Identifier. Also a default rule.

But this is misleading doc.

  • It's not a character class. (Character class rules match a single character in that character class. <ident> matches one or more characters that fit a pattern, albeit one that involves character classes.)

  • All rules are default rules. (I think the default comment is there to emphasize that you can write your own <ident> rule if you don't like the built-in pattern, something that's also true but generally much less sensical for rules that correspond to canonical character classes such as <lower>.)

  • It doesn't match identifiers, or more accurately it fails to match many Perl6 identifiers. (See the How the <ident> rule is used in matching Perl 6 identifiers section at the end of this answer for an explanation of what it does as part of matching an identifier.)

The rest of this answer

Feel free to consider this question sufficiently answered and move on to something else. For those interested in more detail I've written the following sections:

  • The high level logic that calls the <ident> rule

  • The mid level logic that's in the <ident> rule

  • The machine level code that implements the <ident> rule

  • The specification of the <ident> rule

  • How the <ident> rule is used in matching Perl 6 identifiers

The high level logic that calls the <ident> rule

The grammar XML::Grammar statement introduces a user defined Perl 6 grammar.

A grammar is a class. ("Grammars are really just slightly specialized classes".)

A rule is a method. (say rule { ... } ~~ Method; # True.)

Any class created using a declaration of the form grammar foo { ... } inherits from the Grammar class which in turn inherits from the Match class.

In the Rakudo Perl 6 compiler, the Match class does the role NQPMatchRole.

NQPMatchRole includes an ident rule (in this case with a regular method declaration).

Because grammar XML::Grammar does not declare an ident rule, the <ident> call dispatches to the method in NQPMatchRole following normal method dispatch logic.

The mid level logic that's in the <ident> rule

NQPMatchRole is written in the nqp language, a subset of Perl 6 used to bootstrap the full Perl 6, and the heart of NQP, a compiler toolkit.

Excerpting and reformatting just the most salient code from the ident method declaration that's written in nqp, the rule begins with:

(    nqp::ord($target, $!pos) == 95
  || nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos)   )

This matches if the first character is either a _ (95 is the ASCII code / Unicode codepoint for an underscore) or a character matching a character class defined in NQP called CCLASS_ALPHABETIC.

The other bit of salient code is:

nqp::findnotcclass( nqp::const::CCLASS_WORD

This matches zero or more subsequent characters in the character class CCLASS_WORD.

A search of NQP for CCLASS_ALPHABETIC shows several matches. The most useful seems to be an NQP test file. While this file makes it clear that CCLASS_WORD is a superset of CCLASS_ALPHABETIC, it doesn't make it clear what those classes actually match.

The machine level code that implements the <ident> rule

NQP targets multiple "backends" or concrete virtual machines.

Given the relative paucity of Rakudo/NQP doc/tests of what these rules and character classes actually match, one has to look at one of these backends to verify what's what.

MoarVM is the only formally supported backend.

A search of MoarVM for CCLASS shows several matches.

The important one seems to be ops.c which includes a switch (cclass) statement which in turn includes cases for MVM_CCLASS_ALPHABETIC and MVM_CCLASS_WORD that correspond to NQP's similarly named constants.

According to the code's comments:

CCLASS_ALPHABETIC currently matches exactly the same characters as the full Perl 6 or NQP <:L> rule, i.e. the characters Unicode has classified as "Letters".

I think that means <alpha> is equivalent to the union of CCLASS_ALPHABETIC and _ (underscore).

CCLASS_WORD matches the same plus <:Nd>, i.e. decimal digits (in any human language, not just English).

I think that means the full Perl 6 level / NQP <alnum> rule is equivalent to CCLASS_WORD.

The specification of the <ident> rule

The official specification of Perl 6 is embodied in roast1.

A search of roast for ident shows several matches.

Most use <ident> only incidentally. The specification requires that they work as shown, but you won't understand what <ident> is supposed to do by looking at incidental usage.

Three tests clearly test <ident> itself. One of those is essentially redundant, leaving two. I see no changes between the 6.c and 6.c.errata versions of these two matches:

From S05-mass/rx.t:

ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';

ok tests that its first argument returns True. This call tests that <ident> skips 2+3 and matches ab2.

From S05-mass/charsets.t:

is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';

is tests that its first argument matches its second. This call tests what the ident rule matches from a string consisting of the first 256 Unicode codepoints (the Latin-1 character set).

Here's a variation of this test that more clearly shows the matching that happens:

say ~$_ for $latin-chars ~~ m:g/<ident>/;

prints:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ

<ident> actually matches a lot more than just a hundred or so characters from Latin-1. So, while the above tests cover what <ident> is officially specified/tested to match, they clearly don't cover the full picture.

So let's look at the other source commonly associated with "specification", the historical Perl 6 design docs.

First, we note the warning at the top of the design docs:

Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.

The term "specs" in this warning is short for "specification". As already explained, the official specification test suite is roast.

(Some people still think of these historical design docs as "specifications" too, and refer to them as "specs", but a more apt way to look at things is that "specs" as applied to the design docs is short for "speculations" or perhaps "specious nonsense" if the former doesn't make the point clear enough that they're not something to be fully relied upon.)

A search for ident in design.perl6.org shows several matches.

The most useful match is in the Predefined Subrules section of S05:

These are some of the predefined subrules for any grammar or regex:

  • ident ... Match an identifier.

So, now we see where the docs got their notion of the meaning of <ident> from.

How the <ident> rule is used in matching Perl 6 identifiers

my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >; 
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)

In nqp's grammar, which is defined in NQP's Grammar.nqp, there's:

token identifier { <.ident> [ <[\-']> <.ident> ]* }

In Perl 6's grammar, which is defined in Rakudo's Grammar.nqp, there's code that looks slightly different but has the exact same effect:

token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }

So <identifier> matches a pattern that includes one or more <ident>s.

The ident method is in NQPMatchRole, which means it's a built-in that's part of the rule namespace of users' grammars.

The identifier methods are not exported by either Perl 6 or nqp so they're not part of the rule namespace of users' grammars.

If I write my own indentifier token we can see it in action:

my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)

displays:

($bar %hash Foo your_ident my-ident)

To summarize the above and some other considerations:

  • <ident> matches just parts of what <identifier> matches (though they're the same for the simplest names). Consider is-prime which is a Perl 6 identifier but contains two <ident> matches;

  • <identifier> matches just parts of "Perl 6 identifiers" (though they're the same for the simplest names). Consider infix:<+> which is sometimes also referred to as a Perl 6 identifier but requires both an <identifier> match and a colon pair pattern match;

  • Perl 6 identifiers match just parts of names (though they're the same for the simplest names). Consider Foo-Bar::Baz-Qux which contains two <identifier> matches (each in turn containing two <ident> matches).


1 The official specification of Perl 6 is a test suite called roast -- the Repository Of All Specification Tests. The latest version of a specific branch of roast defines a specific version of Perl 6. So far, there have only been two official branches/versions of roast, and therefore Perl 6. The first was/is 6.c aka 6.Christmas. This was cut on Christmas day 2015 and has been deliberately left frozen since that day. The second is 6.c.errata, which very conservatively adds corrections to 6.c deemed backwards compatible enough and/or too important not to be available as the current official recommended version of Perl 6. An "officially compliant" Perl 6 compiler passes some official branch of roast. The Rakudo compiler passes 6.c.errata.

If you read all the tests involving a feature in, say, the 6.c.errata branch of roast, then you'll have read a full definition of the officially specified meaning of that feature for the 6.c.errata version of the Perl 6 language.



回答2:

In general, the place to look for the documentation is Perl6 documentation. That's part of a regex, and you can find it in the definition of character classes. It matches Perl6 identifiers. What the . in front of ident does is to suppress capture.