I am new to c++ programming and wish to use the cparse library found here https://github.com/cparse/cparse in my project. I want to interpret strings of user input like "a*(b+3)
" (for variables a
and b
) and use it as a function repeatedly on different sets of input.
For example, taking a text file as input with 2 double
numbers per line my code will write a new file with the result of "a*(b+3)
" on each line (assuming a
is the first number and b
is the second).
My problem arises when I try and include the cparse library from git. I followed the setup instructions naively (being new to git):
$ cd cparse
$ make release
But I cannot use the make command as I am using windows. I tried downloading the zip file and copying the .cpp
and .h
files into the project directly and using the "include existing" option in visual studio, but get a huge number of compiler errors and can't make the code work myself.
Am I missing the point somehow? How do I get it to work?
Failing that, is there another way to parse strings of user input and use them as functions?
I would like to answer this part of your question because I have the feeling that a full-featured C parser might be a little bit too heavy for what might be your intention. (Btw. once you got the C parser running – how to process its output? Dynamic linking?)
Instead, I want to show you how to build a simple calculator (with recursive descent parser) on your own. For the documentation of the techniques I will use, I warmly recommend Compilers (Principles, Techniques & Tools) by Aho, Lam, Sethi, Ullman (better known as "Dragon books") and especially Chapter 4.
The Tiny Calculator Project
In the following I describe my sample solution part by part.
Language Design
Before starting to write a compiler or interpreter, it's reasonable to define a language which shall be accepted. I want to use a very limited sub-set of C: expressions consisting of
+
and-
+
,-
,*
, and/
()
;
(to mark the end of an expression, mandatory).Whitespaces (including line-breaks) will be simply ignored but may be used to separate things as well as to improve human readability. C or C++ like comments (and a lot of other sugar) I didn't consider to keep the source code as minimal as possible. (For all that, I got nearly 500 lines.)
The specific example of the OP will fit into this language with an added semicolon:
There will be only one type supported:
double
. Thus, I don't need types or any declaration which makes things easier.Before I started to sketch the grammar of this language, I was thinking about the "target" of compiling and decided to make classes for...
An Abstract Syntax Tree
At first – a class to store variables:
The variables provide storage for a value but not for the identifier. The latter is stored separately as it is not needed for the actual usage of a variable but only to find it by name:
The usage of a
std::map
automates the creation of a variable. As known from many high level languages, a variable starts to exist when its accessed the first time.The Abstract Syntax Tree is a
I took this text from the above linked Wikipedia article – couldn't said it shorter. In the following my classes for the AST:
Thus, using the AST classes, the representation of sample
a*(b+3)
would beNote:
There is no class derived from
Expr
to represent the parentheses()
because this is simply not necessary. The processing of parentheses is considered when building the tree itself. Normally, the operators with higher precedence become children of the operators with lower precedence. As a result, the former are computed before the latter. In the above example, the instance ofExprBinOpAdd
is a child of the instance ofExprBinOpMul
(although precedence of multiply is higher than precedence of add) which results from the proper consideration of the parentheses.Beside of storing a parsed expression, this tree can be used to compute the expression by calling the
Expr::solve()
method of the root node:Having a backend for our Tiny Calculator, next is the front-end.
A Grammar
A formal language is best described by a grammar.
with start symbol
program
.The rules contain
:
) to separate left side from right side (non-terminal on left side may expand to the symbols on the right side).|
) to separate alternatives<empty>
symbol for expanding to nothing (used to terminate recursions).From the terminal symbols, I will derive the tokens for the scanner.
The non-terminal symbols will be transformed into the parser functions.
The separation of
addExpr
andmulExpr
has been done intentionally. Thus, the precedence of multiplicative operators over additive operators will be "burnt in" the grammar itself. Obviously, the floating point constants, variable identifiers, or expressions in parentheses (accepted inprimExpr
) will have highest precedence.The rules contain only right recursions. This is a requirement for recursive-descent parsers (as I've learnt theoretically in the Dragon books and from practical experiences in the debugger until I fully understood the reason why). Implementing a left recursive rule in a recursive-descent parser results in non-terminated recursions which, in turn, end up with a StackOverflow.
The Scanner
It's usual to split the compiler in a scanner and a parser.
The Scanner processes the input character stream and groups characters together to tokens. The tokens are used as terminal symbols in the parser.
For the output of tokens, I made a class. In my professional projects, it stores additionally the exact file position to denote its origin. This is convenient to tag created objects with source code references as well as any output of error messages and debugging info. (...left out here to keep it as minimal as possible...)
There are two enumerators for special tokens:
EOT
... end of text (remarks the end of input)Error
... generated for any character which does not fit in any other token.The tokens are used as output of the actual scanner:
The scanner is used in the parser.
The Parser
Basically, the parser consists essentially of a bunch of rule functions (according to the grammar rules) which are calling each other. The class around the rule functions is responsible to manage some global parser context. Hence, the only public method of
class Parser
iswhich constructs an instance (with the private constructor) and calls the function
Parser::parseProgram()
corresponding to the start symbolprogram
.Internally, the parser calls the
Scanner::scan()
method to fill its look-ahead token.This is done in
Parser::scan()
which is called always when a token has to be consumed.Looking closer, a pattern becomes visible how the rules have been translated into the parser functions:
Each non-terminal on the left side becomes a parse function. (Looking closer in the source code, you will realize that I didn't do this exactly. Some of the rules have been "inlined". – From my point of view, I inserted some extra rules to simplify the grammar which I didn't intend to transform from beginning. Sorry.)
Alternatives (
|
) are implemented asswitch (_lookAhead.tk)
. Thereby, each case label corresponds to the first terminal(s) (token(s)) to which the left most symbol may expand. (I believe that's why it is called "look-ahead parser" – decisions to apply rules are always done based on the look ahead token.) The dragon book has a subject about FIRST-FOLLOW sets which explains this in more detail.For terminal symbols,
Parser::scan()
is called. In special cases, it is replaced byParser::match()
if precisely one terminal (token) is expected.For non-terminal symbols, the call of the corresponding function is done.
Symbol sequences are simply done as sequences of the above calls.
The error-handling of this parser is the most simple I ever did. It could/should be done much more supporting (investing more effort, i.e. additional lines of code). (...but here I tried to keep it minimal...)
Put Together
For testing and demonstration, I prepared a
main()
function with some built-in samples (source code of program and data to process):I compiled and tested on VS2013 (Windows 10) and got:
Please, consider that the parser itself ignores any spaces and line-breaks. However, to make the sample output formatting simple, I have to use one semicolon-terminated expression per line. Otherwise, it will be difficult to associate the source code lines with the corresponding compiled expressions. (Remember my note above about the
Token
to which a source code reference (aka file position) might be added.)The Complete Sample
...with source code and test run can be found on ideone.