How to create a Tokenizer with the util.regex.Matc

Actually I have the following regex tokens (school duty):

Identificator = [a-zA-Z_][a-zA-Z0-9_]*
Integer = [0-9]+
ReservedKeywords = true|false|while|foreach|for|plus
Symbols = *|/|-|\(|\)|
Blank = \s+

I can't use the Scanner class because there may be no whitespaces between certain tokens. Note that Parser (given it receives correct tokens) is ready, and also typechecking and evaluation of the AST. Only the "simplest" part is missing so "Tokenizer" and there are not enough complete enough examples on the internet.

I don't understand documentation of the util.regex.Matcher class, it is very confusing.

Actually is legal having

[ ReservedKeyword| Identificator| Integer ] followed by a symbol
symbol followed by [ ReservedKeyword| Identificator| Integer| Symbol ]
[ ReservedKeyword| Identificator| Integer| Symbol ] followed by a blank
a blank followed by [ ReservedKeyword| Identificator| Integer| Symbol ]
[ ReservedKeyword| Identificator| Integer| Symbol| Blank ] followed by End of Stream/String

We have to use the Matcher class, so there is no chance to hardcode the tokenizer in anyway (that would be too simple: a simple state machine + map lookup, but we are not allowed to do that).

The tokenizer must have 2 methods ("hasNext" and "next").

I need some example to see how to use the Matcher to match a string with delimiters that depend on context (the Scanner class is not suitable because will "eat" delimiters, while delimiters are part of the grammar see following example:

(3 plus 5)*(8/3*7)

It should be tokenized to

(.3.plus.5.).*.(.8./.3.*.7.)

I can use "(|)|\s+" as delimiter but then the scanner will return just

3 plus 5 * 8 / 3 * 7

and due to operators associativity the result will be

3 plus (((5*8)/3)*7)

wich is incorrect.

I need to do the following:

Given a set of patterns (Identificator, Integer, ReservedKeywords, Symbols, Blank, whatever). I need to match the first occurrence of any of these patterns. The delimiters are "Symbols | Blank" but the delimiters should not be discarded, instead they should be returned as tokens. This must be done using the Matcher class.

An example showing how to tokenize a string using as delimiters "Blank | Symbols" returning or the delimited string or the delimiter itself should be enough.

标签： java regex matcher

1条回答

Rolldiameter

2楼-- · 2019-07-31 08:24

After taking hard time to figure out how the Matcher work I was able to create a tokenizer a little more sophisticated than the usual Scanner. Since no one answered Here's the relevant part (since this was school assignment I can share the code):

private final Scanner scanner;
private final String delimiter =  "\\*|/|-|\\(|\\)|";

private final Pattern delim  = Pattern.compile(delimiter);
private Matcher delim_matcher;

private String  region;
private int     regionStart;
private int     regionEnd;
private int     start;
private int     end;

// Called by constructor, I stripped the constructor because trivial
private void Init(){
    scanner = new Scanner(System.in);
    region = scanner.next();
    delim_matcher = delim.matcher(region);
    regionStart = 0;
    regionEnd = region.length();
}

private boolean nextDelimiter(){
    boolean found = delim_matcher.find();
    start = found ? delim_matcher.start() : delim_matcher.regionEnd();
    end = found? delim_matcher.end(): delim_matcher.regionEnd();
    return found;
}

private boolean hasPrefix(){
    return start > regionStart;
}

public TokenType next() throws NoSuchElementException{
    //find next delimiter ( symbol )
    boolean found = nextDelimiter(); //TODO: see breakpoints here

    if(hasPrefix()){

        //there was something before the delimiter (keyword, identificator etc.)
        decodePrefix( region.substring(regionStart,start) );

        if(found)
            delim_matcher.region(start,regionEnd); //reset to match symbol next time

        regionStart = start; //hasPrefix -> false
        return tokenType;   
    }
    else if(!hasPrefix() && found){

        decodeSymbol( region.substring(start,end)); 
        delim_matcher.region(end,regionEnd); //reset to skip already found symbol
        regionStart = end;
        return tokenType;
    }
    else{

        if(scanner.hasNext()){ //next is not a whitespace (because scanner already skip blanks)
            region = scanner.next(); 
            delim_matcher = delim.matcher(region);
            regionStart = 0;
            regionEnd = region.length();
            return next();
        }else
            return tokenType = EOF;
    }
}

public boolean hasNext() {
    return tokenType != EOF; //EOF is a value of the enum "TokenType"
}

As anticipated this Tokenizer is more usefull than the Scanner class. The Scanner class has the downside of discarding delimiters (since a symbol may be a delimiter when parsing a program I don't want them to be discarded).

This Tokenizer use the Scanner to retrieve blank delimited strings, then use additional processing to split the strings around symbols.

0人赞添加讨论(0) 举报

How to create a Tokenizer with the util.regex.Matc

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间