I want to parse a file with lines of the following content:
simple word abbr -8. (012) word, simple phrase, one another phrase - (simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose") This is the end of the line
so now explaining the parts (not all spaces are described, because of the markup here):
simple word
is one or several words (phrase) separated by the whitespaceabbr -
is a fixed part of the string (never changes)8
- optional number.
- always includedword, simple phrase, one another phrase
- one or several words or phrases separated by comma- (
- fixed part, always includedsimply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!;
- (optional) one or several phrases separated by;
"It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose"
- (optional) one or several phrases with quotation marks"
separated by;
) This is the end of the line
- always included
In the worst case there are no phrases in clause, but this is uncommon: there should be a phrase without augmenting quotation marks (phrase1
type) or with them (phrase2
type).
So the phrases are Natural Language sentences (with all the punctuation possible)...
BUT:
- the internal content is irrelevant (i.e. I do not need to parse the Natural Language itself in the NLP meaning)
- it is just required to mark it as a
phrase1
orphrase2
types:- those without and with quotation marks, i.e., if the phrase, which is placed between
(
and;
or;
and;
or;
and)
or even between(
and)
is augmented with quotation marks, then it isphrase2
type - otherwise, if the phrase either begins or ends without quotation marks, though it could have all the marks within the phrase, it is the
phrase1
type
- those without and with quotation marks, i.e., if the phrase, which is placed between
Since to write a Regex (PCRE) for such an input is an overkill, so I looked to the parsing approach (EBNF or similar). I ended up with a PEG.js parser generator. I created a basic grammar variants (even not handling the part with the different phrases in the clause):
start = term _ "abbr" _ "-" .+
term = word (_? word !(_ "abbr" _ "-"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
or (the difference is only in " abbr -"
and "_ "abbr" _ "-""
):
start = term " abbr -" .+
term = word (_? word !(" abbr -"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
But even this simple grammar cannot parse the beginning of the string. The errors are:
Parse Error Expected [A-Za-z] but " " found.
Parse Error Expected "abbr" but "-" found.
- etc.
So it looks the problem is in the ambiguity: "abbr"
is consumed withing the term
as a word
token. Although I defined the rule !(" abbr -")
, which I thought has a meaning, that the next word
token will be only consumed, if the next substring is not of " abbr -"
kind.
I didn't found any good examples explaining the following expressions of PEG.js, which seems to me as a possible solution of the aforementioned problem [from: http://pegjs.majda.cz/documentation]:
& expression
! expression
$ expression
& { predicate }
! { predicate }
TL;DR:
related to PEG.js:
are there any examples of applying the rules:
& expression
! expression
$ expression
& { predicate }
! { predicate }
general question:
- what is the possible approach to handle such complex strings with intuitive ambiguous grammars? This is still not a Natural Language and it looks like it has some formal structure, just with several optional parts. One of the ideas is to split the strings by preprocessing (with the help of Regular Expressions, in the places of fixed elements, i.e. "abbr -" ") This is the end of the line") and then create for each splited part a separate grammar. But it seems to have have performance issues and scalability problems (i.e. - what if the fixed elements will change a bit - e.g. there will be no
-
char anymore.)
Update1:
I found the rule which solves the problem with matching the "abbr -"
ambiguity:
term = term:(word (!" abbr -" _? word))+ {return term.join("")}
but the result looks strange:
[
"simple, ,word",
" abbr -",
[
"8",
...
],
...
]
if removing the predicate: term = term:(word (!" abbr -" _? word))+
:
[
[
"simple",
[
[
undefined,
[
" "
],
"word"
]
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
I expected something like:
[
[
"simple word"
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
or at least:
[
[
"simple",
[
" ",
"word"
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
The expression is grouped, so why is it separated in so many nesting levels and even undefined
is included in the output? Are there any general rules to fold the result based on the expression in the rule?
Update2:
I created the grammar so that it parses as desired, though I didn't yet identified the clear process of such a grammar creation:
start
= (term:term1 (" abbr -" number "." _ "("number:number") "{return number}) terms:terms2 ((" - (" phrases:phrases ")" .+){return phrases}))
//start //alternative way = looks better
// = (term:term1 " abbr -" number "." _ "("number:number") " terms:terms2 " - (" phrases:phrases ")" .+){return {term: term, number: number, phrases:phrases}}
term1
= term1:(
start_word:word
(rest_words:(
rest_word:(
(non_abbr:!" abbr -"{return non_abbr;})
(space:_?{return space[0];}) word){return rest_word.join("");})+{return rest_words.join("")}
)) {return term1.join("");}
terms2
= terms2:(start_word:word (rest_words:(!" - (" ","?" "? word)+){rest_words = rest_words.map(function(array) {
return array.filter(function(n){return n != null;}).join("");
}); return start_word + rest_words.join("")})
phrases
// = ((phrase_t:(phrase / '"' phrase '"') ";"?" "?){return phrase_t})+
= (( (phrase:(phrase2 / phrase1) ";"?" "?) {return phrase;})+)
phrase2
= (('"'pharse2:(phrase)'"'){return {phrase2: pharse2}})
phrase1
= ((pharse1:phrase){return {phrase1: pharse1}})
phrase
= (general_phrase:(!(';' / ')' / '";' / '")') .)+ ){return general_phrase.map(function(array){return array[1]}).join("")}
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
number = digits:digit+{return digits.join("")}
digit = [0-9]
_ "whitespace"
= [ \t\n\r]*
It could be tested either on the PEG.js author's site: [http://pegjs.majda.cz/online] or on the PEG.js Web-IDE: [http://peg.arcanis.fr/]
If somebody has answers for the previous questions (i.e. general approach for disambiguation of the grammar, examples to the expressions available in PEG.js) as well as improvement advices to the grammar itself (this is I think far away from an ideal grammar now), I would very appreciate!