Actually I have the following regex tokens (school duty):
Identificator = [a-zA-Z_][a-zA-Z0-9_]*
Integer = [0-9]+
ReservedKeywords = true|false|while|foreach|for|plus
Symbols = *|/|-|\(|\)|
Blank = \s+
I can't use the Scanner class because there may be no whitespaces between certain tokens. Note that Parser (given it receives correct tokens) is ready, and also typechecking and evaluation of the AST. Only the "simplest" part is missing so "Tokenizer" and there are not enough complete enough examples on the internet.
I don't understand documentation of the util.regex.Matcher class, it is very confusing.
Actually is legal having
- [ ReservedKeyword| Identificator| Integer ] followed by a symbol
- symbol followed by [ ReservedKeyword| Identificator| Integer| Symbol ]
- [ ReservedKeyword| Identificator| Integer| Symbol ] followed by a blank
- a blank followed by [ ReservedKeyword| Identificator| Integer| Symbol ]
- [ ReservedKeyword| Identificator| Integer| Symbol| Blank ] followed by End of Stream/String
We have to use the Matcher class, so there is no chance to hardcode the tokenizer in anyway (that would be too simple: a simple state machine + map lookup, but we are not allowed to do that).
The tokenizer must have 2 methods ("hasNext" and "next").
I need some example to see how to use the Matcher to match a string with delimiters that depend on context (the Scanner class is not suitable because will "eat" delimiters, while delimiters are part of the grammar see following example:
(3 plus 5)*(8/3*7)
It should be tokenized to
(.3.plus.5.).*.(.8./.3.*.7.)
I can use "(|)|\s+" as delimiter but then the scanner will return just
3 plus 5 * 8 / 3 * 7
and due to operators associativity the result will be
3 plus (((5*8)/3)*7)
wich is incorrect.
I need to do the following:
Given a set of patterns (Identificator, Integer, ReservedKeywords, Symbols, Blank, whatever). I need to match the first occurrence of any of these patterns. The delimiters are "Symbols | Blank" but the delimiters should not be discarded, instead they should be returned as tokens. This must be done using the Matcher class.
An example showing how to tokenize a string using as delimiters "Blank | Symbols" returning or the delimited string or the delimiter itself should be enough.
After taking hard time to figure out how the Matcher work I was able to create a tokenizer a little more sophisticated than the usual Scanner. Since no one answered Here's the relevant part (since this was school assignment I can share the code):
As anticipated this Tokenizer is more usefull than the Scanner class. The Scanner class has the downside of discarding delimiters (since a symbol may be a delimiter when parsing a program I don't want them to be discarded).
This Tokenizer use the Scanner to retrieve blank delimited strings, then use additional processing to split the strings around symbols.