I would like to tokenize a string that consists of integers,floats, operators, functions, variables and parentheses. The following example should brighten the essence of problem:
Current state:
String infix = 4*x+5.2024*(Log(x,y)^z)-300.12
Desired state:
String tokBuf[0]=4
String tokBuf[1]=*
String tokBuf[2]=x
String tokBuf[3]=+
String tokBuf[4]=5.2024
String tokBuf[5]=*
String tokBuf[6]=(
String tokBuf[7]=Log
String tokBuf[8]=(
String tokBuf[9]=x
String tokBuf[10]=,
String tokBuf[11]=y
String tokBuf[12]=)
String tokBuf[13]=^
String tokBuf[14]=z
String tokBuf[15]=)
String tokBuf[16]=-
String tokBuf[17]=300.12
All tips and solutions would be appreciated.
Use the Java stream tokenizer. The interface is a bit strange but one gets used to it:
http://docs.oracle.com/javase/7/docs/api/java/io/StreamTokenizer.html
Example code that parses to the requested String list (you probably want to use the tokenizer directly or at least use an Object list so you can store numbers directly as Double):
public static List<String> tokenize(String s) throws IOException {
StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
tokenizer.ordinaryChar('-'); // Don't parse minus as part of numbers.
tokenizer.ordinaryChar('/'); // Don't treat slash as a comment start.
List<String> tokBuf = new ArrayList<String>();
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
switch(tokenizer.ttype) {
case StreamTokenizer.TT_NUMBER:
tokBuf.add(String.valueOf(tokenizer.nval));
break;
case StreamTokenizer.TT_WORD:
tokBuf.add(tokenizer.sval);
break;
default: // operator
tokBuf.add(String.valueOf((char) tokenizer.ttype));
}
}
return tokBuf;
}
Test run:
System.out.println(tokenize("4*x+5.2024*(Log(x,y)^z)-300.12"));
[4.0, *, x, +, 5.2024, *, (, Log, (, x, ,, y, ), ^, z, ), -, 300.12]
http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
http://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools
Example of algorithm:
step#1: read '4' => numeric token => read chars until reach non-num symbol(that is ' * '). The first just read, tokBuf[0] is a numeric token.
step#2: read '*' => token represents a binary operator.
step#3: read 'x'. Perhaps, ot a function symbol => mark the next token as var-token.
And so on.
The next step is evaluation, I guess? Reverse Polish notation or syntax trees will help...