is there a way to read given amount of instructions from a binary executable file on x86 architecture programmatically?
If I had a binary of a simple C program hello.c
:
#include <stdio.h>
int main(){
printf("Hello world\n");
return 0;
}
Where after compilation using gcc
, the disassembled function main
looks like this:
000000000000063a <main>:
63a: 55 push %rbp
63b: 48 89 e5 mov %rsp,%rbp
63e: 48 8d 3d 9f 00 00 00 lea 0x9f(%rip),%rdi # 6e4 <_IO_stdin_used+0x4>
645: e8 c6 fe ff ff callq 510 <puts@plt>
64a: b8 00 00 00 00 mov $0x0,%eax
64f: 5d pop %rbp
650: c3 retq
651: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
658: 00 00 00
65b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Is there an easy way in C to read for example first three instructions (meaning the bytes 55, 48, 89, e5, 48, 8d, 3d, 9f, 00, 00, 00
) from main
? It is not guaranteed that the function looks like this - the first instructions may have all different opcodes and sizes.
this prints the 10 first bytes of the main
function by taking the address of the function and converting to a pointer of unsigned char
, print in hex.
This small snippet doesn't count the instructions. For this you would need an instruction size table (not very difficult, just tedious unless you find the table already done, What is the size of each asm instruction?) to be able to predict the size of each instruction given the first byte.
(unless of course, the processor you're targetting has a fixed instruction size, which makes the problem trivial to solve)
Debuggers have to decode operands as well, but in some cases like step or trace, I suspect they have a table handy to compute the next breakpoint address.
#include <stdio.h>
int main(){
printf("Hello world\n");
const unsigned char *start = (const char *)&main;
int i;
for (i=0;i<10;i++)
{
printf("%x\n",start[i]);
}
return 0;
}
output:
Hello world
55
89
e5
83
e4
f0
83
ec
20
e8
seems to match the disassembly :)
00401630 <_main>:
401630: 55 push %ebp
401631: 89 e5 mov %esp,%ebp
401633: 83 e4 f0 and $0xfffffff0,%esp
401636: 83 ec 20 sub $0x20,%esp
401639: e8 a2 01 00 00 call 4017e0 <___main>
.globl _start
_start:
bl main
b .
.globl main
main:
add r1,#1
add r2,#1
add r3,#1
add r4,#1
b main
intentionally wrong architecture, architecture doesnt matter file format matters. built this into an elf file format, which is very popular, and is simply a file format which is what I understood your question to be, to read a file, not modify the binary to read the program runtime from memory.
it is very much popular and there are tools that do it which you appear to know how to run.
Disassembly of section .text:
00001000 <_start>:
1000: eb000000 bl 1008 <main>
1004: eafffffe b 1004 <_start+0x4>
00001008 <main>:
1008: e2811001 add r1, r1, #1
100c: e2822001 add r2, r2, #1
1010: e2833001 add r3, r3, #1
1014: e2844001 add r4, r4, #1
1018: eafffffa b 1008 <main>
if I hexdump the file though
00000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 28 00 01 00 00 00 00 10 00 00 34 00 00 00 |..(.........4...|
00000020 c0 11 00 00 00 02 00 05 34 00 20 00 01 00 28 00 |........4. ...(.|
00000030 06 00 05 00 01 00 00 00 00 00 00 00 00 00 00 00 |................|
00000040 00 00 00 00 1c 10 00 00 1c 10 00 00 05 00 00 00 |................|
00000050 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 00 00 00 eb fe ff ff ea 01 10 81 e2 01 20 82 e2 |............. ..|
00001010 01 30 83 e2 01 40 84 e2 fa ff ff ea 41 11 00 00 |.0...@......A...|
00001020 00 61 65 61 62 69 00 01 07 00 00 00 08 01 00 00 |.aeabi..........|
00001030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00001040 00 00 00 00 00 10 00 00 00 00 00 00 03 00 01 00 |................|
00001050 00 00 00 00 00 00 00 00 00 00 00 00 03 00 02 00 |................|
00001060 01 00 00 00 00 00 00 00 00 00 00 00 04 00 f1 ff |................|
00001070 06 00 00 00 00 10 00 00 00 00 00 00 00 00 01 00 |................|
00001080 18 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |................|
00001090 09 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |................|
000010a0 17 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |................|
000010b0 55 00 00 00 00 10 00 00 00 00 00 00 10 00 01 00 |U...............|
000010c0 23 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |#...............|
000010d0 2f 00 00 00 08 10 00 00 00 00 00 00 10 00 01 00 |/...............|
000010e0 34 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |4...............|
000010f0 3c 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |<...............|
00001100 43 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |C...............|
00001110 48 00 00 00 00 00 08 00 00 00 00 00 10 00 01 00 |H...............|
00001120 4f 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |O...............|
00001130 00 73 6f 2e 6f 00 24 61 00 5f 5f 62 73 73 5f 73 |.so.o.$a.__bss_s|
00001140 74 61 72 74 5f 5f 00 5f 5f 62 73 73 5f 65 6e 64 |tart__.__bss_end|
00001150 5f 5f 00 5f 5f 62 73 73 5f 73 74 61 72 74 00 6d |__.__bss_start.m|
00001160 61 69 6e 00 5f 5f 65 6e 64 5f 5f 00 5f 65 64 61 |ain.__end__._eda|
00001170 74 61 00 5f 65 6e 64 00 5f 73 74 61 63 6b 00 5f |ta._end._stack._|
00001180 5f 64 61 74 61 5f 73 74 61 72 74 00 00 2e 73 79 |_data_start...sy|
00001190 6d 74 61 62 00 2e 73 74 72 74 61 62 00 2e 73 68 |mtab..strtab..sh|
000011a0 73 74 72 74 61 62 00 2e 74 65 78 74 00 2e 41 52 |strtab..text..AR|
000011b0 4d 2e 61 74 74 72 69 62 75 74 65 73 00 00 00 00 |M.attributes....|
000011c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000011e0 00 00 00 00 00 00 00 00 1b 00 00 00 01 00 00 00 |................|
000011f0 06 00 00 00 00 10 00 00 00 10 00 00 1c 00 00 00 |................|
00001200 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 |................|
00001210 21 00 00 00 03 00 00 70 00 00 00 00 00 00 00 00 |!......p........|
00001220 1c 10 00 00 12 00 00 00 00 00 00 00 00 00 00 00 |................|
00001230 01 00 00 00 00 00 00 00 01 00 00 00 02 00 00 00 |................|
00001240 00 00 00 00 00 00 00 00 30 10 00 00 00 01 00 00 |........0.......|
00001250 04 00 00 00 05 00 00 00 04 00 00 00 10 00 00 00 |................|
00001260 09 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 |................|
00001270 30 11 00 00 5c 00 00 00 00 00 00 00 00 00 00 00 |0...\...........|
00001280 01 00 00 00 00 00 00 00 11 00 00 00 03 00 00 00 |................|
00001290 00 00 00 00 00 00 00 00 8c 11 00 00 31 00 00 00 |............1...|
000012a0 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
000012b0
can google the file format and find a lot of info at wikipedia, with a smidge more at one of the links
useful header information
00 10 00 00 entrh
34 00 00 00 phoff
c0 11 00 00 shoff
00 02 00 05 flags
34 00 ehsize
20 00 phentsize
01 00 phnum
28 00 shentsize
06 00 shnum
05 00shstrndx
so if I look at the beginning of the sections there are shnum number of them
0x11C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x11E8 1b 00 00 00 01 00 00 00 06 00 00 00 00 10 00 00 00 10 00 00
0x1210 21 00 00 00 03 00 00 70 00 00 00 00 00 00 00 00 1c 10 00 00
0x1238 01 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 30 10 00 00
0x1260 09 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 30 11 00 00
0x1288 11 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 8c 11 00 00
0x1260 strtab type offset 0x1130 which is broken into null terminated strings until you hit a double null
[0] 00
[1] 73 6f 2e 6f 00 so.o
[2] 24 61 00 $a
[3] 5f 5f 62 73 73 5f 73 74 61 72 74 5f 5f 00 __bss_start__
[4] 5f 5f 62 73 73 5f 65 6e 64 5f 5f 00 __bss_end__
[5] 5f 5f 62 73 73 5f 73 74 61 72 74 00 __bss_start
[6] 6d 61 69 6e 00 main
...
main is at address 0x115F in the file which is offset 0x2F in the
strtab.
0x1238 symtab starts at 0x1030, 0x10 or 16 bytes per entry
00001030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00001040 00 00 00 00 00 10 00 00 00 00 00 00 03 00 01 00 |................|
00001050 00 00 00 00 00 00 00 00 00 00 00 00 03 00 02 00 |................|
00001060 01 00 00 00 00 00 00 00 00 00 00 00 04 00 f1 ff |................|
00001070 06 00 00 00 00 10 00 00 00 00 00 00 00 00 01 00 |................|
00001080 18 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |................|
00001090 09 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |................|
000010a0 17 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |................|
000010b0 55 00 00 00 00 10 00 00 00 00 00 00 10 00 01 00 |U...............|
000010c0 23 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |#...............|
000010d0 2f 00 00 00 08 10 00 00 00 00 00 00 10 00 01 00 |/...............|
000010e0 34 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |4...............|
000010f0 3c 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |<...............|
00001100 43 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |C...............|
00001110 48 00 00 00 00 00 08 00 00 00 00 00 10 00 01 00 |H...............|
00001120 4f 00 00 00 1c 10 01 00 00 00 00 00 10 00 01 00 |O...............|
000010d0 2f 00 00 00 has the 0x2f offset in the symbol table
so this is main, from this entry the address 08 10 00 00 or 0x1008 in
the processors memory, unfortunately due to the values I chose it happens to also be the file offset, dont get that confused.
this section is type 00000001 PROGBITS
0x11E8 1b 00 00 00 01 00 00 00 06 00 00 00 00 10 00 00 00 10 00 00
offset 0x1000 in the file 0x1C bytes
here is the program, the machine code.
00001000 00 00 00 eb fe ff ff ea 01 10 81 e2 01 20 82 e2
00001010 01 30 83 e2 01 40 84 e2 fa ff ff ea 41 11
so starting at memory offset 0x1008 which is 8 bytes after the
entry point (unfortunately I picked a bad address to use) we need to
go 0x8 bytes offset into this data
01 10 81 e2 01 20 82 e2
00001008 <main>:
1008: e2811001 add r1, r1, #1
100c: e2822001 add r2, r2, #1
1010: e2833001 add r3, r3, #1
this is all very file dependent, the cpu could care less about labels, main only means something to the humans, not the cpu.
If I convert the elf into other formats which are perfectly executable:
motorola s record:
S00A0000736F2E7372656338
S1131000000000EBFEFFFFEA011081E2012082E212
S10F1010013083E2014084E2FAFFFFEAB1
S9031000EC
raw binary image
hexdump -C so.bin
00000000 00 00 00 eb fe ff ff ea 01 10 81 e2 01 20 82 e2 |............. ..|
00000010 01 30 83 e2 01 40 84 e2 fa ff ff ea |.0...@......|
0000001c
The instruction bytes of interest are of course there, but the symbol information isnt. It depends on the file format you are interested in as to 1) if you can find "main" and then 2) print out the first few bytes at that address.
Hmm, a bit disturbing, but if you link for 0x2000 gnu ld burns some disk space and puts the offset at 0x2000, but choose 0x20000000 and it burns more disk space but not as much
000100d0 2f 00 00 00 08 00 00 20 00 00 00 00 10 00 01 00
shows the file offset is 0x010010 but the address in target space is 0x20000008
00010010 01 30 83 e2 01 40 84 e2 fa ff ff ea 41 11 00 00
00010020 00 61 65 61 62 69 00 01 07 00 00 00 08 01
just to demonstrate/enforce the file offset and the target memory space address are two different things.
this is a very nice format for what you are wanting to do
arm-none-eabi-objcopy -O symbolsrec so.elf so.srec
cat so.srec
$$ so.srec
$a $20000000
_bss_end__ $2001001c
__bss_start__ $2001001c
__bss_end__ $2001001c
_start $20000000
__bss_start $2001001c
main $20000008
__end__ $2001001c
_edata $2001001c
_end $2001001c
_stack $80000
__data_start $2001001c
$$
S0090000736F2E686578A1
S31520000000000000EBFEFFFFEA011081E2012082E200
S31120000010013083E2014084E2FAFFFFEA9F
S70520000000DA