I am using Boost::Spirit to build simple "data-filter" language in my C++ GUI application for non-technical users. Language is very similiar to plain English and parse-able into AST. I am requested to make the process as user-friendly as possible, so I wish to provide CLang-like error messages ("Unrecognized 'tokken', did you meant 'token'?") and autocompletion.
The question in place is how to use Boost::Spirit grammar to generate possible token list for both goals (I will use simple string distance algorithm to satisfy first requirement)?
My ideas so far:
- Add position and length in source stream to token definition obtained from parser.
- Add special invalid token to grammar, which will match anything... well... not valid :-)
- If user presses ctrl + space, build AST (with invalid token the tree will be always buildable), then look for token inside current cursor position
- Use simple matcher on all possible tokens (I have token list after all) and prepare a list of suggestions
The problem with this solution is the suggestion will also suggest tokens that are invalid for given place. And if I add (and I will) definable identifiers, I have much bigger problem in hand...
One more constraint: I want to have grammar for this language defined in only one place; If the grammar changes, I want to autocompleter be aware of this changes after recompilation
Out of curiosity, I decided to try the concept.
Here's my attempt.
Plan
Let's parse arithmetic expressions with function calls.
Now, we want to parse (partial) expression with possible unknown identifiers.
In case of incomplete expressions, we want to "imply" the minimal token to complete it (and potentially continue parsing).
In case of unknown identifiers, we want to fuzzy-match the known identifiers in the domain (either variables or functions) and rank them in order of decreasing probability.
Base Definitions
Let's start out by deciding we want our input to be in memory, so we can refer to locations/substrings by using
string_view
s:Completion Hints
Besides the AST, we want our parser to generate completion hints (the implicit completion tokens and suggestions)
In addition, let's code up a quick & dirty fuzzy identifier match scoring function.
I opted for a simple synchronizing compare that
adj_diff
is a good fuzzy match foradjacent_difference
even though characters have been skipped from the candidate, butadj_qqq_diff
is worse because theqqq
from the input could not be matched)rate=1
at first invocation)We'll see how this is used in the grammar.
AST
A run-of-the-mill AST that can do binary expressions, string/numeric literals, variables and (nested) function calls:
Grammar
Basics
The grammar starts off pretty "basic" as well:
So far so good: The constructor stores a reference to the
Completion::Hints&
to be recorded. All these rules have been defined asqi::rule<It, Expression(), qi::space_type>
.Implied Tokens
Now it gets slightly more interesting, some rules here are lexemes¹
And some productions are tolerant of missing closing tokens:
The
implied
magic makes it so that if the expected closing token is missing, it will be recorded as a hint, and parsing continues:Of course, this begs the question what
imply(_1, ch)
really does, and we'll see later.Identifier Lookup
We store "symbol tables" in
Domain
members_variables
and_functions
:Then we declare some rules that can do lookups on either of them:
The corresponding declarations will be shown shortly.
Variables are pretty simple:
Calls are trickier. If a name is unknown we don't want to assume it implies a function unless it's followed by a
'('
character. However, if an identifier is a known function name, we want even to imply the(
(this gives the UX the appearance of autocompletion where when the user typessqrt
, it suggests the next character to be(
magically).It all builds on
known
,unknown
andmaybe_known
:Debug, Predefined Variables/Functions
Remaining is a bit of red tape:
Phoenix Helpers
Let's start with the usual helper to construct binary AST nodes:
Similarly, we can have a Phoenix function object to register implied chars:
Note that it orders by location (which makes UX easier) and detects when a previous implied token lived at the same location, in which case we simply append to it.
By now it won't be much of a surprise that generating relevant suggestions for unknown identifiers is also a phoenix actor:
That's all. If it looks more complicated, that's because it does a lot more: it scores all candidates fuzzily, orders them by descending score and removes any candidates with score < 3.
BONUS
Let's alter things a little more and allow
','
to be implied inside argument lists:Let's replace
','
there:TEST DRIVER
This post would be nothing without some nice test cases. So here goes:
Note that, besides the first input (
""
) everything is heuristically parsed as some kind of expression! The last one is designed to show off how all the parameter lists are implied becauseprint
,sqrt
andsin
are known functions. Then some','
are implied, before finally the unclosed string literal and remaining parentheses are closed. Here's the (non-debug) output:Full Listing / Live Demo
Live On Coliru
¹ Boost spirit skipper issues
Spirit doesn't have that feature. You could generate it yourself but it would be considerable effort to do it generically (if not straight up impossible, due NP-completeness). Perhaps just detect parse error (
on_error
) and have a limited number of stock "options" - the 80% rule should go a long way.Also, I think the "sketch" with parsing 'invalid placeholder tokens' will not work because you'll have to build in assumptions about the type of the placeholder token and it may therefore not result in a valid expression.
I sense that you treat expression parsing as little more than tokenizing - which isn't accurate in most cases.