How to create a Tokenizer with the util.regex.Matc

2019-07-31 07:37发布

Actually I have the following regex tokens (school duty):

Identificator = [a-zA-Z_][a-zA-Z0-9_]*
Integer = [0-9]+
ReservedKeywords = true|false|while|foreach|for|plus
Symbols = *|/|-|\(|\)|
Blank = \s+

I can't use the Scanner class because there may be no whitespaces between certain tokens. Note that Parser (given it receives correct tokens) is ready, and also typechecking and evaluation of the AST. Only the "simplest" part is missing so "Tokenizer" and there are not enough complete enough examples on the internet.

I don't understand documentation of the util.regex.Matcher class, it is very confusing.

Actually is legal having

  • [ ReservedKeyword| Identificator| Integer ] followed by a symbol
  • symbol followed by [ ReservedKeyword| Identificator| Integer| Symbol ]
  • [ ReservedKeyword| Identificator| Integer| Symbol ] followed by a blank
  • a blank followed by [ ReservedKeyword| Identificator| Integer| Symbol ]
  • [ ReservedKeyword| Identificator| Integer| Symbol| Blank ] followed by End of Stream/String

We have to use the Matcher class, so there is no chance to hardcode the tokenizer in anyway (that would be too simple: a simple state machine + map lookup, but we are not allowed to do that).

The tokenizer must have 2 methods ("hasNext" and "next").

I need some example to see how to use the Matcher to match a string with delimiters that depend on context (the Scanner class is not suitable because will "eat" delimiters, while delimiters are part of the grammar see following example:

(3 plus 5)*(8/3*7)

It should be tokenized to

(.3.plus.5.).*.(.8./.3.*.7.)

I can use "(|)|\s+" as delimiter but then the scanner will return just

3 plus 5 * 8 / 3 * 7

and due to operators associativity the result will be

3 plus (((5*8)/3)*7)

wich is incorrect.

I need to do the following:

Given a set of patterns (Identificator, Integer, ReservedKeywords, Symbols, Blank, whatever). I need to match the first occurrence of any of these patterns. The delimiters are "Symbols | Blank" but the delimiters should not be discarded, instead they should be returned as tokens. This must be done using the Matcher class.

An example showing how to tokenize a string using as delimiters "Blank | Symbols" returning or the delimited string or the delimiter itself should be enough.

1条回答
Rolldiameter
2楼-- · 2019-07-31 08:24

After taking hard time to figure out how the Matcher work I was able to create a tokenizer a little more sophisticated than the usual Scanner. Since no one answered Here's the relevant part (since this was school assignment I can share the code):

private final Scanner scanner;
private final String delimiter =  "\\*|/|-|\\(|\\)|";

private final Pattern delim  = Pattern.compile(delimiter);
private Matcher delim_matcher;

private String  region;
private int     regionStart;
private int     regionEnd;
private int     start;
private int     end;

// Called by constructor, I stripped the constructor because trivial
private void Init(){
    scanner = new Scanner(System.in);
    region = scanner.next();
    delim_matcher = delim.matcher(region);
    regionStart = 0;
    regionEnd = region.length();
}

private boolean nextDelimiter(){
    boolean found = delim_matcher.find();
    start = found ? delim_matcher.start() : delim_matcher.regionEnd();
    end = found? delim_matcher.end(): delim_matcher.regionEnd();
    return found;
}

private boolean hasPrefix(){
    return start > regionStart;
}

public TokenType next() throws NoSuchElementException{
    //find next delimiter ( symbol )
    boolean found = nextDelimiter(); //TODO: see breakpoints here

    if(hasPrefix()){

        //there was something before the delimiter (keyword, identificator etc.)
        decodePrefix( region.substring(regionStart,start) );

        if(found)
            delim_matcher.region(start,regionEnd); //reset to match symbol next time

        regionStart = start; //hasPrefix -> false
        return tokenType;   
    }
    else if(!hasPrefix() && found){

        decodeSymbol( region.substring(start,end)); 
        delim_matcher.region(end,regionEnd); //reset to skip already found symbol
        regionStart = end;
        return tokenType;
    }
    else{

        if(scanner.hasNext()){ //next is not a whitespace (because scanner already skip blanks)
            region = scanner.next(); 
            delim_matcher = delim.matcher(region);
            regionStart = 0;
            regionEnd = region.length();
            return next();
        }else
            return tokenType = EOF;
    }
}

public boolean hasNext() {
    return tokenType != EOF; //EOF is a value of the enum "TokenType"
}

As anticipated this Tokenizer is more usefull than the Scanner class. The Scanner class has the downside of discarding delimiters (since a symbol may be a delimiter when parsing a program I don't want them to be discarded).

This Tokenizer use the Scanner to retrieve blank delimited strings, then use additional processing to split the strings around symbols.

查看更多
登录 后发表回答