I have read about StringTokenizer
, StreamTokenizer
, Scanner
, Pattern
and Matcher
from the java.util.regex
package. I have also read about opinions on them and I am realy confused: which one is the best to use?
What I need to do is to make an Assambler, that is, to parse a file containing Assembly language from that file and I need to transform it into Machine Code.
For example if I have the assembly code:
MOV R15,R12
This should translate to hexa numbers coresponding to each instruction and register.
Let's just say that the translation is as follows:
MOV
becomes10 F3
R15
becomes11 F2
R12
becomes20 1E
Thus, my output file should be:
10 F3 11 F2 20 1E
Now I need to parse the Assembly file to identify each instruction and what comes after it.
For those who know microcontroller there are many ways for an instruction to appear. My question is:
Using Java, which is the best method to transform each word from my file into tokens (using any of the aforementioned classes), so that I can find the matching one and write it into a file.
ldi R13,0x31
I need to have ldi
in one token, r13
in another and 31
in another
If your goal is to do a good job parsing, you need to develop a proper BNF and use a real parser/lexer pair. Just hacking around with StringTokenizer or String.split or regex is not going to hack it.
As @trigooner says, you need a proper lexer/parse to be context sensitive, although most assembler code doesn't have much context. But if you are saying "assembler code" as a short hand, and you could be really reading a proper macro assembler, then they do have context. When you have context, you need a proper implementation.
Most x86 assembler languages are pretty simple. If you are reading code for older systems, say a PDP-10, then you have to handle the complexity.
Well, everything you mentioned is pretty good for simply tokenizing a string or file. In the latest JDK, StringTokenizer is deprecated and more efficient tokenizers like Scanner and even String.split() exist. However, I don't think this is what you want. You seem to be needing a lexer, or at least a lexer-parser. Because you want to make sense of the tokens, not just split them based on some separator. So either you right your own - if you're on drugs - or just use one of the very good and existing tools out there. Like ANTLR http://www.antlr.org/ It's free too, but may be a little hard to use. Also there's JavaCC. Good luck!