Basic requirement is use keyword as identifier, so I want to distinguish the token from it's context.(e.g.class
is a keyword, but we allowed a variable named class
).
In java, this is possible, but it's so hard, here is how I do it
TOKEN :
{
<I_CAL: "CAL"> : DO_CAL
| <I_CALL: "CALL">
| <I_CMP: "CMP">
| <I_EXIT: "EXIT">
| <I_IN: "IN">
| <I_JMP: "JMP">
| <I_JPC: "JPC"> : NEED_CMP_OP
| <I_LD: "LD"> : NEED_DATA_TYPE
| <I_NOP: "NOP">
| <I_OUT: "OUT">
| <I_POP: "POP">
| <I_PUSH: "PUSH">
| <I_RET: "RET">
| <I_DATA: "DATA"> : DO_DATA
| <I_BLOCK: ".BLOCK">
}
// T prefix for Token
TOKEN :
{
<T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
// We need below TOKEN in special context, other wise they are just IDENTIFIER
// | <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT">
// | <PSEUDO_DATA_TYPE: "CHAR" >
// | <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD">
// | <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
| <T_LABEL: <IDENTIFIER> ([" "])* <COLON>>
}
// Now we need a CMP OP
<NEED_CMP_OP> TOKEN:
{
<CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> : DEFAULT
}
// Now we need a DATA TYPE
<NEED_DATA_TYPE,DO_CAL> TOKEN:
{
// EXTENSION Add char to data type
<DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT" | "CHAR"> {
if(curLexState == DO_CAL){
SwitchTo(NEED_CAL_OP);
}else{
SwitchTo(DEFAULT);
}
}
}
// We need a CAL OP
<NEED_CAL_OP> TOKEN:
{
<CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD"> : DEFAULT
}
// Aslo need to skip the empty
<NEED_DATA_TYPE,NEED_CAL_OP,NEED_CMP_OP,DO_CAL,DO_DATA> SKIP:
{
" "
| "\t"
| "\r"
| "\f"
}
Source is here, I can distinguish the token from context by curLexState
.
It is works, but fussy to do, need to add a lot extra state, and maintain a lot states.Is there any easy way to achieve this ?
There are three ways to do this outlined in the JavaCC FAQ.
Below I'll give three examples of the third approach.
Using keywords as identifiers
If all you want to do is to allow the keyword class to be used as a variable name, there is a very simple way to do this. In the lexer put in the usual rules.
In the parser write a production
Then use
varName()
elsewhere in the parser.The original poster's assembler
Turning to the assembler example in the original question, let's look at the JPC instruction as an example. The JPC (Jump conditional) instruction is followed by a comparison operator such as Z, B, etc and then an operand that can be a number of things including identifiers. E.g. we could have
But we could also have an identifier named JPC or Z, so
and
are also a valid JPC instructions.
In the lexical part we have
In the parser we have
I would suggest excluding register names from the list of other keywords that could be used as identifiers.
If you do include
<T_REGISTER>
in that list, then there will be an ambiguity in operand becauseOperand
looks like thisNow there is an ambiguity because
has two parses. In the context of being an operand, we want tokens like "R0" to be parsed as registers not identifiers. Luckly JavaCC will prefer earlier choices, so this is exactly what will happen. You will get a warning from JavaCC. You can ignore the warning. (I add a comment to my source code so that other programmers don't worry.) Or you can suppress the warning with a lookahead specification.
Using right context
So far all the examples have used left context. I.e. we can tell how to treat a token based solely on the sequence of tokens to its left. Let's look at a case where the interpretation of a keyword is based on the tokens to the right.
Consider this simple imperative language in which all the keywords can be used as variable names.
This grammar is unambiguous. You can make the grammar more complicated by adding new kinds of statements, expressions and left-hand sides; as long as the grammar stays unambiguous, such complications probably won't make much difference to what I'm going to say next. Feel free to experiment.
The grammar is not LL(1). There are two places where a choice must be made based on more than one future token. One is the choice between
Assignment
andIfElse
when the next token is "if". Consider the blockvs
We can look ahead for a ":=" like this
The other place we need to look ahead is when an "else" or an "end" is encountered at the start of a Block. Consider
We can solve this with
If you merge the lexer and parser into a character-oriented parser, then it is relatively easy to distinguish keywords in context, because the parser is all about retaining context. You could operate JavaCC on character tokens to achieve this effect, but its LL nature would probably make it impossible to write practical grammars for other reasons.
If you separate lexer and parser, this isn't easy.
You are asking the lexer to know when something is an identifier or a keyword, which it can only do by knowing the context which the Id/keyword is found.
Ideally the lexer would simply ask the parser for its state, and that would identify the contexts in which the choice is made. That's hard to organize; most parsers aren't designed to reveal their state easily or in a form easy to interpret for extracting the context signal needed. JavaCC isn't obviously organized this way.
Your other obvious choice is to model the different contexts as states in the lexer, with transitions between lexing states corresponding to transitions between interesting contexts. This may or may not be easy depending the context. If you can do it, you have to code the states and the transitions in your lexer and keep them up to date. When you can do this "easily", it is not a bad solution. This can be hard or impossible depending the specific contexts.
For OPs purpose (apparantly a parser for an assembler), the context is usually determined by the position within the source line. One can qualitatively divide assembler input into Label, Opcode, Operand, Comment contexts by watching whitespace: A newline sets the context to Label, whitespace in Label mode sets context to Opcode, whitespace in Opcode sets Operand context, and whitespace in Operand context sets Comment context. With these state transitions, one can write different sublexers for each context, thus having different keywords in each subcontext.
This trick doesn't work for languages like PL/I, which have vast numbers of keywords in context (for PL/I, in fact, every keyword is only in context!).
A non-obvious choices is to not try to differentiate at all. When an Id/Keyword is found, feed both tokens to the parser, and let it sort out which one leads to a viable parse. (Note: it may have handle the cross product of multiple ambiguous tokens, thus many possible parses while sorting this out.) This requires a parser that can handle ambiguity, both while parsing, and in the tokens it accepts (or it can't accept both an ID and a Keyword token at the same time). This is a beautifully simple solution to use when you have the right parsing machinery. JavaCC isn't that machinery.
[See my bio for a GLR parsing engine in which all 3 solutions are easily accessible. It handles Pl/I easily.