Problems with an ambiguous grammar and PEG.js (no

2019-02-22 22:43发布

问题:

I want to parse a file with lines of the following content:

simple word abbr -8. (012) word, simple phrase, one another phrase - (simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose") This is the end of the line

so now explaining the parts (not all spaces are described, because of the markup here):

  • simple word is one or several words (phrase) separated by the whitespace
  • abbr - is a fixed part of the string (never changes)
  • 8 - optional number
  • . - always included
  • word, simple phrase, one another phrase - one or several words or phrases separated by comma
  • - ( - fixed part, always included
  • simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; - (optional) one or several phrases separated by ;
  • "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose" - (optional) one or several phrases with quotation marks "separated by ;
  • ) This is the end of the line - always included

In the worst case there are no phrases in clause, but this is uncommon: there should be a phrase without augmenting quotation marks (phrase1 type) or with them (phrase2 type).

So the phrases are Natural Language sentences (with all the punctuation possible)...

BUT:

  • the internal content is irrelevant (i.e. I do not need to parse the Natural Language itself in the NLP meaning)
  • it is just required to mark it as a phrase1 or phrase2 types:
    • those without and with quotation marks, i.e., if the phrase, which is placed between ( and ; or ; and ; or ; and ) or even between ( and ) is augmented with quotation marks, then it is phrase2 type
    • otherwise, if the phrase either begins or ends without quotation marks, though it could have all the marks within the phrase, it is the phrase1 type

Since to write a Regex (PCRE) for such an input is an overkill, so I looked to the parsing approach (EBNF or similar). I ended up with a PEG.js parser generator. I created a basic grammar variants (even not handling the part with the different phrases in the clause):

start = term _ "abbr" _ "-" .+
term = word (_? word !(_ "abbr" _ "-"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
  = [ \t\n\r]*

or (the difference is only in " abbr -" and "_ "abbr" _ "-"" ):

start = term " abbr -" .+
term = word (_? word !(" abbr -"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
  = [ \t\n\r]*

But even this simple grammar cannot parse the beginning of the string. The errors are:

  • Parse Error Expected [A-Za-z] but " " found.
  • Parse Error Expected "abbr" but "-" found.
  • etc.

So it looks the problem is in the ambiguity: "abbr" is consumed withing the term as a word token. Although I defined the rule !(" abbr -"), which I thought has a meaning, that the next word token will be only consumed, if the next substring is not of " abbr -" kind.

I didn't found any good examples explaining the following expressions of PEG.js, which seems to me as a possible solution of the aforementioned problem [from: http://pegjs.majda.cz/documentation]:

  • & expression
  • ! expression
  • $ expression
  • & { predicate }
  • ! { predicate }

TL;DR:

related to PEG.js:

  • are there any examples of applying the rules:

    • & expression
    • ! expression
    • $ expression
    • & { predicate }
    • ! { predicate }

general question:

  • what is the possible approach to handle such complex strings with intuitive ambiguous grammars? This is still not a Natural Language and it looks like it has some formal structure, just with several optional parts. One of the ideas is to split the strings by preprocessing (with the help of Regular Expressions, in the places of fixed elements, i.e. "abbr -" ") This is the end of the line") and then create for each splited part a separate grammar. But it seems to have have performance issues and scalability problems (i.e. - what if the fixed elements will change a bit - e.g. there will be no - char anymore.)

Update1:

I found the rule which solves the problem with matching the "abbr -" ambiguity:

term = term:(word (!" abbr -" _? word))+ {return term.join("")}

but the result looks strange:

[
   "simple, ,word",
   " abbr -",
   [
      "8",
      ...
   ],
   ...
]

if removing the predicate: term = term:(word (!" abbr -" _? word))+:

[
   [
      "simple",
      [
         [
            undefined,
            [
               " "
            ],
            "word"
         ]
      ]
   ],
   " abbr -",
   [
      "8",
      ".",
      " ",
      "(",
      ...
   ],
   ...
]

I expected something like:

[
   [
      "simple word"
   ],
   " abbr -",
   [
      "8",
      ".",
      " ",
      "(",
      ...
   ],
   ...
]

or at least:

[
   [
      "simple",
      [
         " ",
         "word"
      ]
   ],
   " abbr -",
   [
      "8",
      ".",
      " ",
      "(",
      ...
   ],
   ...
]

The expression is grouped, so why is it separated in so many nesting levels and even undefined is included in the output? Are there any general rules to fold the result based on the expression in the rule?

Update2:

I created the grammar so that it parses as desired, though I didn't yet identified the clear process of such a grammar creation:

start
  = (term:term1 (" abbr -" number "." _ "("number:number") "{return number}) terms:terms2 ((" - (" phrases:phrases ")" .+){return phrases}))

//start //alternative way = looks better
//  = (term:term1 " abbr -" number "." _ "("number:number") " terms:terms2 " - (" phrases:phrases ")" .+){return {term: term, number: number, phrases:phrases}}

term1
  = term1:(
    start_word:word
        (rest_words:(
          rest_word:(
            (non_abbr:!" abbr -"{return non_abbr;})
            (space:_?{return space[0];}) word){return rest_word.join("");})+{return rest_words.join("")}
        )) {return term1.join("");}


terms2
  = terms2:(start_word:word (rest_words:(!" - (" ","?" "? word)+){rest_words = rest_words.map(function(array) {
    return array.filter(function(n){return n != null;}).join("");
}); return start_word + rest_words.join("")})

phrases
//  = ((phrase_t:(phrase / '"' phrase '"') ";"?" "?){return phrase_t})+
  = (( (phrase:(phrase2 / phrase1) ";"?" "?) {return phrase;})+)

phrase2
  = (('"'pharse2:(phrase)'"'){return {phrase2: pharse2}})

phrase1
  = ((pharse1:phrase){return {phrase1: pharse1}})

phrase
  = (general_phrase:(!(';' / ')' / '";' / '")') .)+ ){return general_phrase.map(function(array){return array[1]}).join("")}

word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
number = digits:digit+{return digits.join("")}
digit = [0-9]
_ "whitespace"
  = [ \t\n\r]*

It could be tested either on the PEG.js author's site: [http://pegjs.majda.cz/online] or on the PEG.js Web-IDE: [http://peg.arcanis.fr/]

If somebody has answers for the previous questions (i.e. general approach for disambiguation of the grammar, examples to the expressions available in PEG.js) as well as improvement advices to the grammar itself (this is I think far away from an ideal grammar now), I would very appreciate!

回答1:

so why is it separated in so many nesting levels and even undefined is included in the output?

If you look at the documentation for PEG.js, you'll see almost every operator collects the results of its operands into an array. undefined is returned by the ! operator.

The $ operator bypasses all this nesting and just gives you the actual string that matches, eg: [a-z]+ will give an array of letters, but $[a-z]+ will give a string of letters.

I think most of the parsing here follows the pattern: "give me everything until I see this string". You should express this in PEG by first using ! to make sure you haven't hit the terminating string, and then just taking the next character. For example, to get everything up to " abbr -":

(!" abbr -" .)+

If the terminating string is a single character, you can use [^] as a short form of this, eg: [^x]+ is a shorter way of saying (!"x" .)+.

Parsing comma/semicolon separated phrases rather than comma/semicolon terminated phrases is slightly annoying, but treating them as optional terminators seems to work (with some triming).

start = $(!" abbr -" .)+ " abbr -" $num "." [ ]? "(012)" 
  phrase_comma+ "- (" noq_phrase_semi+ q_phrase_semi+ ")"
  $.*
phrase_comma    =      p:$[^-,]+    [, ]* { return p.trim() }
noq_phrase_semi = !'"' p:$[^;]+     [; ]* { return p.trim() }
q_phrase_semi   =  '"' p:$[^"]+ '"' [; ]* { return p }
num = [0-9]+

gives

[
    "simple word",
    " abbr -",
    "8",
    ".",
    " ",
    "(012)",
    [
        "word",
        "simple phrase",
        "one another phrase"
    ],
    "- (",
    [
        "simply dummy text of the printing",
        "Lorem Ipsum : \"Lorem\" - has been the industry's standard dummy text, ever since the 1500s!"
    ],
    [
        "It is a long established!",
        "Sometimes by accident, sometimes on purpose (injected humour and the like)",
        "sometimes on purpose"
    ],
    ")",
    " This is the end of the line"
]