Does .parse anchor or :sigspace first in a Perl 6

2019-07-21 12:25发布

问题:

I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?

I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:

grammar Number {
    rule TOP { \d+ }
    }

my @strings = '137', '137 ', ' 137 ';

for @strings -> $string {
    my $result = Number.parse( $string );
    given $result {
        when Match { put "<$string> worked!" }
        when Any   { put "<$string> failed!" }
        }
    }

With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:

<137> worked!
<137 > worked!
< 137 > failed!

I figure this means that rule is applying :sigspace first and the anchors afterward:

grammar Foo {
    regex TOP { ^ :sigspace \d+ $ }
    }

I expected a rule to allow leading whitespace, which would happen if you switched the order:

grammar Foo {
    regex TOP { :sigspace ^  \d+ $ }
    }

I could add an explicit token in rule for the beginning of the string:

grammar Number {
    rule TOP { ^ \d+ }
    }

Now everything works:

<137> worked!
<137 > worked!
< 137 > worked!

I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:

Note that if you're parsing with .parse method, token TOP is automatically anchored

and

When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.


I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.

回答1:

The behavior is intended, and is a culmination of these language features:

  • Sigspace ignores whitespace before the first atom.

    From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):

    The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.

    This means:

    rule TOP { \d+ }
                  ^-------- <.ws> automatically inserted
    
    rule TOP { ^ \d+ $ }
                ^---^-^---- <.ws> automatically inserted
    
  • Regexes are first-class compiled code with lexical scoping.

    A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.

    Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:

    The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
  • The anchoring of rule TOP is done at run time by .parse.

    S05, line 44231:

    The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)

    I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.

    It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).

So when you write

rule TOP { \d+ }

it is compiled as

regex TOP { :r \d+ <.ws> }

And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:

regex TOP { ^ [:r :s \d+] $ }

1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.



回答2:

There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.

My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.



标签: grammar perl6