Which is the best way to parse a file containing A

2019-09-03 08:27发布

I have read about StringTokenizer, StreamTokenizer, Scanner, Pattern and Matcher from the java.util.regex package. I have also read about opinions on them and I am realy confused: which one is the best to use?

What I need to do is to make an Assambler, that is, to parse a file containing Assembly language from that file and I need to transform it into Machine Code.

For example if I have the assembly code:

MOV R15,R12

This should translate to hexa numbers coresponding to each instruction and register.

Let's just say that the translation is as follows:

  • MOV becomes 10 F3
  • R15 becomes 11 F2
  • R12 becomes 20 1E

Thus, my output file should be:

10 F3 11 F2 20 1E

Now I need to parse the Assembly file to identify each instruction and what comes after it.

For those who know microcontroller there are many ways for an instruction to appear. My question is:

Using Java, which is the best method to transform each word from my file into tokens (using any of the aforementioned classes), so that I can find the matching one and write it into a file.

ldi R13,0x31

I need to have ldi in one token, r13 in another and 31 in another

2条回答
戒情不戒烟
2楼-- · 2019-09-03 08:57

If your goal is to do a good job parsing, you need to develop a proper BNF and use a real parser/lexer pair. Just hacking around with StringTokenizer or String.split or regex is not going to hack it.

As @trigooner says, you need a proper lexer/parse to be context sensitive, although most assembler code doesn't have much context. But if you are saying "assembler code" as a short hand, and you could be really reading a proper macro assembler, then they do have context. When you have context, you need a proper implementation.

Most x86 assembler languages are pretty simple. If you are reading code for older systems, say a PDP-10, then you have to handle the complexity.

查看更多
迷人小祖宗
3楼-- · 2019-09-03 09:00

Well, everything you mentioned is pretty good for simply tokenizing a string or file. In the latest JDK, StringTokenizer is deprecated and more efficient tokenizers like Scanner and even String.split() exist. However, I don't think this is what you want. You seem to be needing a lexer, or at least a lexer-parser. Because you want to make sense of the tokens, not just split them based on some separator. So either you right your own - if you're on drugs - or just use one of the very good and existing tools out there. Like ANTLR http://www.antlr.org/ It's free too, but may be a little hard to use. Also there's JavaCC. Good luck!

查看更多
登录 后发表回答