is there a way to read given amount of instructions from a binary executable file on x86 architecture programmatically?
If I had a binary of a simple C program hello.c
:
#include <stdio.h>
int main(){
printf("Hello world\n");
return 0;
}
Where after compilation using gcc
, the disassembled function main
looks like this:
000000000000063a <main>:
63a: 55 push %rbp
63b: 48 89 e5 mov %rsp,%rbp
63e: 48 8d 3d 9f 00 00 00 lea 0x9f(%rip),%rdi # 6e4 <_IO_stdin_used+0x4>
645: e8 c6 fe ff ff callq 510 <puts@plt>
64a: b8 00 00 00 00 mov $0x0,%eax
64f: 5d pop %rbp
650: c3 retq
651: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
658: 00 00 00
65b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Is there an easy way in C to read for example first three instructions (meaning the bytes 55, 48, 89, e5, 48, 8d, 3d, 9f, 00, 00, 00
) from main
? It is not guaranteed that the function looks like this - the first instructions may have all different opcodes and sizes.
this prints the 10 first bytes of the
main
function by taking the address of the function and converting to a pointer ofunsigned char
, print in hex.This small snippet doesn't count the instructions. For this you would need an instruction size table (not very difficult, just tedious unless you find the table already done, What is the size of each asm instruction?) to be able to predict the size of each instruction given the first byte.
(unless of course, the processor you're targetting has a fixed instruction size, which makes the problem trivial to solve)
Debuggers have to decode operands as well, but in some cases like step or trace, I suspect they have a table handy to compute the next breakpoint address.
output:
seems to match the disassembly :)
intentionally wrong architecture, architecture doesnt matter file format matters. built this into an elf file format, which is very popular, and is simply a file format which is what I understood your question to be, to read a file, not modify the binary to read the program runtime from memory.
it is very much popular and there are tools that do it which you appear to know how to run.
if I hexdump the file though
can google the file format and find a lot of info at wikipedia, with a smidge more at one of the links
useful header information
so if I look at the beginning of the sections there are shnum number of them
0x1260 strtab type offset 0x1130 which is broken into null terminated strings until you hit a double null
main is at address 0x115F in the file which is offset 0x2F in the strtab.
0x1238 symtab starts at 0x1030, 0x10 or 16 bytes per entry
000010d0 2f 00 00 00 has the 0x2f offset in the symbol table so this is main, from this entry the address 08 10 00 00 or 0x1008 in the processors memory, unfortunately due to the values I chose it happens to also be the file offset, dont get that confused.
this section is type 00000001 PROGBITS
here is the program, the machine code.
so starting at memory offset 0x1008 which is 8 bytes after the entry point (unfortunately I picked a bad address to use) we need to go 0x8 bytes offset into this data
this is all very file dependent, the cpu could care less about labels, main only means something to the humans, not the cpu.
If I convert the elf into other formats which are perfectly executable:
motorola s record:
raw binary image
The instruction bytes of interest are of course there, but the symbol information isnt. It depends on the file format you are interested in as to 1) if you can find "main" and then 2) print out the first few bytes at that address.
Hmm, a bit disturbing, but if you link for 0x2000 gnu ld burns some disk space and puts the offset at 0x2000, but choose 0x20000000 and it burns more disk space but not as much
shows the file offset is 0x010010 but the address in target space is 0x20000008
just to demonstrate/enforce the file offset and the target memory space address are two different things.
this is a very nice format for what you are wanting to do