I'm working with a Lexical Analyzer program right now and I'm using Java. I've been researching for answers on this problem but until now I failed to find any. Here's my problem:
Input:
System.out.println ("Hello World");
Desired Output:
Lexeme----------------------Token
System [Key_Word]
. [Object_Accessor]
out [Key_Word]
. [Object_Accessor]
println [Key_Word]
( [left_Parenthesis]
"Hello World" [String_Literal]
) [right_Parenthesis]
; [statement_separator]
I'm still a beginner so I hope you guys can help me on this. Thanks.
You can use libraries like
Lex & Bison
in C orAntlr
in Java. Lexical analysis can be done through making automata. I'll give you small example:Suppose you need to tokenize a string where keywords (language) are
{'echo', '.', ' ', 'end')
. By keywords I mean language consists of following keywords only. So if I inputMy lexer should output
Now to build automata for such a tokenizer, I can start by
Above diagram is prolly very bad, but idea is that you have a start state represented by
S
now you consumeE
and go to some other state, now you expectN
orC
to come forEND
andECHO
respectively. You keep consuming characters and reach different states within this simple finite state machine. Ultimately, you reach certainEmit
state, for example after consumingE
,N
,D
you reach emit state forEND
which emits the token out and then you go back tostart
state. This cycle continues forever as far as you have characters stream coming to your tokenizer. On invalid character you can either thrown an error or ignore depending on the design.Lexical analysis is a topic by itself that usually goes together with compiler design and analysis. You should read up about it before trying to code anything. My favourite book on this topic is the Dragon book which should give you a good introduction to compiler design and even provides pseudocodes for all compiler phases which you can easily translate to Java and move from there.
In short, the main idea is to parse the input and divide it into tokens which belong to certain classes (parentheses or keywords, for example, in your desired output) using a finite state machine. Process of state machine building is actually the only hard part of this analysis and the Dragon book will provide you with great insight into this thing.
Write a program to make a simple lexical analyzer that will build a symbol table from given stream of chars. You will need to read a file named “input.txt” to collect all chars. For simplicity, input file will be a C/Java/Python program without headers and methods(body of the main progrm). Then you will identify all the numerical values, identifiers, keywords, math operators, logical operators and others[distinct]. See the example for more details. You can assume that, there will be a space after each keyword.
CookCC ( https://github.com/coconut2015/cookcc ) generates a very fast, small, zero-dependency lexer for Java.
You need neither ANTLR nor the Dragon book to write a simple lexical analyzer by hand. Even lexical analyzers for fuller languages (like Java) aren't terribly complicated to write by hand. Obviously if you have an industrial task you might want to consider industrial strength tools like ANTLR or some lex variant, but for the sake of learning how lexical analysis works, writing one by hand would likely prove to be a useful exercise. I'm assuming that this is the case, since you said you're still a beginner.
Here's a simple lexical analyzer, written in Java, for a subset of a Scheme-like language, that I wrote after seeing this question. I think the code is relatively easy to understand even if you've never seen a lexer before, simply because breaking a stream of characters (in this case a
String
) into a stream of tokens (in this case aList<Token>
) isn't that hard. If you have questions I can try to explain in more depth.Example use:
Once you've written one or two simple lexers like this, you will get a pretty good idea of how this problem decomposes. Then it would be interesting to explore how to use automated tools like lex. The theory behind regular expression based matchers is not too difficult, but it does take a while to fully understand. I think writing lexers by hand motivates that study and helps you come to grips with the problem better than diving into the theory behind converting regular expressions to finite automate (first NFAs, then NFAs to DFAs), etc... wading into that theory can be a lot to take in at once, and it is easy to get overwhelmed.
Personally, while the Dragon book is good and very thorough, the coverage might not be the easiest to understand because it aims to be complete, not necessarily accessible. You might want to try some other compiler texts before opening up the Dragon book. Here are a few free books, which have pretty good introductory coverage, IMHO:
http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf
http://www.diku.dk/~torbenm/Basics/
Some articles on the implementation of regular expressions (automated lexical analysis usually uses regular expressions)
http://swtch.com/~rsc/regexp/
I hope that helps. Good luck.
ANTLR 4 will do exactly this with the
Java.g4
reference grammar. You have two options depending on how closely you want the handling of Unicode escape sequences to follow the language specification.ANTLRInputStream
in aJavaUnicodeInputStream
, which processes Unicode escape sequences according to the JLS prior to feeding them to the lexer.Edit: The names of the tokens produced by this grammar differ slightly from your table.
Key_Word
token isIdentifier
Object_Accessor
token isDOT
left_Parenthesis
token isLPAREN
String_Literal
token isStringLiteral
right_Parenthesis
token isRPAREN
statement_separator
token isSEMI