I need to read instructions one-by-one from a small code segment in memory and I have to find out the size of the instructions which I have in memory.
The following is just a example of raw disassembled code to explain my problem:
(gdb) disas /r 0x400281,+8
Dump of assembler code from 0x400281 to 0x400289:
0x0000000000400281: 48 89 c7 movq %rax, %rdi
0x0000000000400284: b0 00 movb $0, %al
0x0000000000400286: e8 f2 48 00 00 callq 0x10001f30a
End of assembler dump.
I know the memory address of the first instruction (p = 0x0000000000400281 in this case) and I can read every memory address from p. The problem is that I cannot know if the value of *(p + offset) is the opcode or not and I know that the size information for every opcode is not fixed.
So, can I get the size of every assembly instruction? Or can I know if the value that I read is opcode or information?
There's a small disassembly library called udis86: http://udis86.sourceforge.net/.
It's small and has decent documentation. If you set the translator to
NULL
viaud_set_syntax
, then the functionud_disassemble
should only decode the instruction and return the number of bytes.There is XED library from Intel to work with x86/x86_64 instructions: https://github.com/intelxed/xed, and it is the only correct way to work with intel machine codes both in x86 and x86_64 modes. It is used by Intel (and was part of their Pin): https://software.intel.com/en-us/articles/xed-x86-encoder-decoder-software-library
https://software.intel.com/sites/landingpage/pintool/docs/67254/Xed/html/main.html XED User Guide (2014) https://software.intel.com/sites/landingpage/pintool/docs/56759/Xed/html/main.html XED2 User Guide (2011)
xed_decode
function will provide you all information about instruction: https://intelxed.github.io/ref-manual/group__DEC.html https://intelxed.github.io/ref-manual/group__DEC.html#ga9a27c2bb97caf98a6024567b261d0652And
xed_ild_decode
will only decode instruction for its length: https://intelxed.github.io/ref-manual/group__DEC.html#ga4bef6152f61997a47c4e0fe4327a3254To get length from
xedd
struct, filled byxed_ild_decode
, usexed_decoded_inst_get_length
: https://intelxed.github.io/ref-manual/group__DEC.html#gad1051f7b86c94d5670f684a6ea79fcdfExample code ("Apache License, Version 2.0", by Intel 2016): https://github.com/intelxed/xed/blob/master/examples/xed-ex-ild.c
Any other solution like manual prefix/opcode parsing or using third-party disassembler may give you wrong results for some rare cases. We don't know which library is used inside Intel to verify their hardware instruction decoders, but xed is the library used by their software decoders in various binary tools. The ild decoder of xed has more than 1600 lines of code: https://github.com/intelxed/xed/blob/master/src/dec/xed-ild.c, and should be more precise than any other library.
Decoding instructions is not that complicated. However, because the Intel family of processors are CISC, it makes the task rather daunting.
First of all, you should not write it in assembler, because it's going to take you a year or two, but maybe you have the time to do that. Since you only need to scan the code, not print out the results, you can do the work much faster than an actual disassembler would do. That being said you'll bump in the same main problems.
First of all, the manuals are there:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html?iid=tech_vt_tech+64-32_manuals
I suggest this one:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf
Then, all you have to do is read one byte and understand it. You have a table on page 770 that shows you the encoding from the op-code to the instruction.
So for example, 0x33 represents an XOR with Gb,Ev as parameters. G means General register defined in the following ModR/M. Then the b is the size (byte). The E means that you have a ModR/M after that one byte (same byte for G and E). So you will have to read that one byte to determine the addressing mode and from that you can determine the register (Can be ignored) and the address size. The address (Ev) may be another register (then no extra byte), it could be immediate data (1, 2, 4, 8 bytes) or it could be an address (again 1, 2, 4, 8 bytes). Pretty simple, right? Note that ALL instructions use the exact same ModR/M so you have to implement that just once. Also the order in which bytes are added after the instruction code is always exactly the same.
Before the address or immediate (if I'm correct) comes the extra Mod for 64 bit instructions. That one defines additional modes and support for the extended registers. All of that is described in detail in the document I mentioned earlier.
More or less, you need your parser to understand the ModR/M, SIB, prefixes, and voilà. It's not that complicated. Then the first byte tells you the instruction (first 2 bytes if the first byte is 0x0F...)
Some instructions also support prefixes to tweak the size of the operands and other similar things. As far as I know, only the 0x66 (op size) and 0x67 (addr size) have an effect on the size of the address and immediate data. The other prefixes will not affect the number of bytes used by the instruction so you can simply ignore them (well count them, but no need to know what follows).
All of that said, using the LLVM library (As someone mentioned in the comments) is probably a better/easier option, although it may be much bigger than what you'd need if your stuff is limited.
@AlexisWilke's response is right: this is messy. He provides the right insights and references to do the work, too.
I have done this work in C. The code follows; this is used in production contexts.
Caveats: It does a good part of the traditional x86 instruction set, but not all, in particular none of the instructions involving the vector register sets. And it contains decoding for a few "virtual" instructions that we happen to use in our code. I don't think extending this to x86-64 would be difficult, but it would get messier. Lastly, this is lifted directly, but I don't make any guarantees this will compile out-of-the box.