How to read binary executable by instructions?

2019-07-27 00:09发布

问题:

is there a way to read given amount of instructions from a binary executable file on x86 architecture programmatically?

If I had a binary of a simple C program hello.c:

#include <stdio.h>

int main(){
    printf("Hello world\n");
    return 0;
}

Where after compilation using gcc, the disassembled function main looks like this:

000000000000063a <main>:
 63a:   55                      push   %rbp
 63b:   48 89 e5                mov    %rsp,%rbp
 63e:   48 8d 3d 9f 00 00 00    lea    0x9f(%rip),%rdi        # 6e4 <_IO_stdin_used+0x4>
 645:   e8 c6 fe ff ff          callq  510 <puts@plt>
 64a:   b8 00 00 00 00          mov    $0x0,%eax
 64f:   5d                      pop    %rbp
 650:   c3                      retq   
 651:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 658:   00 00 00 
 65b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Is there an easy way in C to read for example first three instructions (meaning the bytes 55, 48, 89, e5, 48, 8d, 3d, 9f, 00, 00, 00) from main? It is not guaranteed that the function looks like this - the first instructions may have all different opcodes and sizes.

回答1:

this prints the 10 first bytes of the main function by taking the address of the function and converting to a pointer of unsigned char, print in hex.

This small snippet doesn't count the instructions. For this you would need an instruction size table (not very difficult, just tedious unless you find the table already done, What is the size of each asm instruction?) to be able to predict the size of each instruction given the first byte.

(unless of course, the processor you're targetting has a fixed instruction size, which makes the problem trivial to solve)

Debuggers have to decode operands as well, but in some cases like step or trace, I suspect they have a table handy to compute the next breakpoint address.

#include <stdio.h>

int main(){
    printf("Hello world\n");
    const unsigned char *start = (const char *)&main;
    int i;
    for (i=0;i<10;i++)
    {
       printf("%x\n",start[i]);
    }    
    return 0;
}

output:

Hello world
55
89
e5
83
e4
f0
83
ec
20
e8

seems to match the disassembly :)

00401630 <_main>:
  401630:   55                      push   %ebp
  401631:   89 e5                   mov    %esp,%ebp
  401633:   83 e4 f0                and    $0xfffffff0,%esp
  401636:   83 ec 20                sub    $0x20,%esp
  401639:   e8 a2 01 00 00          call   4017e0 <___main>


回答2:

.globl _start
_start:
    bl main
    b .

.globl main
main:
    add r1,#1
    add r2,#1
    add r3,#1
    add r4,#1
    b main

intentionally wrong architecture, architecture doesnt matter file format matters. built this into an elf file format, which is very popular, and is simply a file format which is what I understood your question to be, to read a file, not modify the binary to read the program runtime from memory.

it is very much popular and there are tools that do it which you appear to know how to run.

Disassembly of section .text:

00001000 <_start>:
    1000:   eb000000    bl  1008 <main>
    1004:   eafffffe    b   1004 <_start+0x4>

00001008 <main>:
    1008:   e2811001    add r1, r1, #1
    100c:   e2822001    add r2, r2, #1
    1010:   e2833001    add r3, r3, #1
    1014:   e2844001    add r4, r4, #1
    1018:   eafffffa    b   1008 <main>

if I hexdump the file though

00000000  7f 45 4c 46 01 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 28 00 01 00 00 00  00 10 00 00 34 00 00 00  |..(.........4...|
00000020  c0 11 00 00 00 02 00 05  34 00 20 00 01 00 28 00  |........4. ...(.|
00000030  06 00 05 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 00 00 1c 10 00 00  1c 10 00 00 05 00 00 00  |................|
00000050  00 00 01 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  00 00 00 eb fe ff ff ea  01 10 81 e2 01 20 82 e2  |............. ..|
00001010  01 30 83 e2 01 40 84 e2  fa ff ff ea 41 11 00 00  |.0...@......A...|
00001020  00 61 65 61 62 69 00 01  07 00 00 00 08 01 00 00  |.aeabi..........|
00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001040  00 00 00 00 00 10 00 00  00 00 00 00 03 00 01 00  |................|
00001050  00 00 00 00 00 00 00 00  00 00 00 00 03 00 02 00  |................|
00001060  01 00 00 00 00 00 00 00  00 00 00 00 04 00 f1 ff  |................|
00001070  06 00 00 00 00 10 00 00  00 00 00 00 00 00 01 00  |................|
00001080  18 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
00001090  09 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010a0  17 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010b0  55 00 00 00 00 10 00 00  00 00 00 00 10 00 01 00  |U...............|
000010c0  23 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |#...............|
000010d0  2f 00 00 00 08 10 00 00  00 00 00 00 10 00 01 00  |/...............|
000010e0  34 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |4...............|
000010f0  3c 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |<...............|
00001100  43 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |C...............|
00001110  48 00 00 00 00 00 08 00  00 00 00 00 10 00 01 00  |H...............|
00001120  4f 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |O...............|
00001130  00 73 6f 2e 6f 00 24 61  00 5f 5f 62 73 73 5f 73  |.so.o.$a.__bss_s|
00001140  74 61 72 74 5f 5f 00 5f  5f 62 73 73 5f 65 6e 64  |tart__.__bss_end|
00001150  5f 5f 00 5f 5f 62 73 73  5f 73 74 61 72 74 00 6d  |__.__bss_start.m|
00001160  61 69 6e 00 5f 5f 65 6e  64 5f 5f 00 5f 65 64 61  |ain.__end__._eda|
00001170  74 61 00 5f 65 6e 64 00  5f 73 74 61 63 6b 00 5f  |ta._end._stack._|
00001180  5f 64 61 74 61 5f 73 74  61 72 74 00 00 2e 73 79  |_data_start...sy|
00001190  6d 74 61 62 00 2e 73 74  72 74 61 62 00 2e 73 68  |mtab..strtab..sh|
000011a0  73 74 72 74 61 62 00 2e  74 65 78 74 00 2e 41 52  |strtab..text..AR|
000011b0  4d 2e 61 74 74 72 69 62  75 74 65 73 00 00 00 00  |M.attributes....|
000011c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000011e0  00 00 00 00 00 00 00 00  1b 00 00 00 01 00 00 00  |................|
000011f0  06 00 00 00 00 10 00 00  00 10 00 00 1c 00 00 00  |................|
00001200  00 00 00 00 00 00 00 00  04 00 00 00 00 00 00 00  |................|
00001210  21 00 00 00 03 00 00 70  00 00 00 00 00 00 00 00  |!......p........|
00001220  1c 10 00 00 12 00 00 00  00 00 00 00 00 00 00 00  |................|
00001230  01 00 00 00 00 00 00 00  01 00 00 00 02 00 00 00  |................|
00001240  00 00 00 00 00 00 00 00  30 10 00 00 00 01 00 00  |........0.......|
00001250  04 00 00 00 05 00 00 00  04 00 00 00 10 00 00 00  |................|
00001260  09 00 00 00 03 00 00 00  00 00 00 00 00 00 00 00  |................|
00001270  30 11 00 00 5c 00 00 00  00 00 00 00 00 00 00 00  |0...\...........|
00001280  01 00 00 00 00 00 00 00  11 00 00 00 03 00 00 00  |................|
00001290  00 00 00 00 00 00 00 00  8c 11 00 00 31 00 00 00  |............1...|
000012a0  00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
000012b0

can google the file format and find a lot of info at wikipedia, with a smidge more at one of the links

useful header information

00 10 00 00 entrh
34 00 00 00 phoff
c0 11 00 00 shoff
00 02 00 05 flags
34 00 ehsize
20 00 phentsize
01 00 phnum
28 00 shentsize
06 00 shnum
05 00shstrndx

so if I look at the beginning of the sections there are shnum number of them

0x11C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x11E8 1b 00 00 00 01 00 00 00 06 00 00 00 00 10 00 00 00 10 00 00
0x1210 21 00 00 00 03 00 00 70 00 00 00 00 00 00 00 00 1c 10 00 00
0x1238 01 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 30 10 00 00
0x1260 09 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 30 11 00 00
0x1288 11 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 8c 11 00 00

0x1260 strtab type offset 0x1130 which is broken into null terminated strings until you hit a double null

[0] 00
[1] 73 6f 2e 6f 00 so.o
[2] 24 61 00 $a
[3] 5f 5f 62 73 73 5f 73 74 61 72 74 5f 5f 00 __bss_start__
[4] 5f 5f 62 73 73 5f 65 6e 64 5f 5f 00 __bss_end__
[5] 5f 5f 62 73 73  5f 73 74 61 72 74 00 __bss_start
[6] 6d 61 69 6e 00 main
...

main is at address 0x115F in the file which is offset 0x2F in the strtab.

0x1238 symtab starts at 0x1030, 0x10 or 16 bytes per entry

00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001040  00 00 00 00 00 10 00 00  00 00 00 00 03 00 01 00  |................|
00001050  00 00 00 00 00 00 00 00  00 00 00 00 03 00 02 00  |................|
00001060  01 00 00 00 00 00 00 00  00 00 00 00 04 00 f1 ff  |................|
00001070  06 00 00 00 00 10 00 00  00 00 00 00 00 00 01 00  |................|
00001080  18 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
00001090  09 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010a0  17 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010b0  55 00 00 00 00 10 00 00  00 00 00 00 10 00 01 00  |U...............|
000010c0  23 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |#...............|
000010d0  2f 00 00 00 08 10 00 00  00 00 00 00 10 00 01 00  |/...............|
000010e0  34 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |4...............|
000010f0  3c 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |<...............|
00001100  43 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |C...............|
00001110  48 00 00 00 00 00 08 00  00 00 00 00 10 00 01 00  |H...............|
00001120  4f 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |O...............|

000010d0 2f 00 00 00 has the 0x2f offset in the symbol table so this is main, from this entry the address 08 10 00 00 or 0x1008 in the processors memory, unfortunately due to the values I chose it happens to also be the file offset, dont get that confused.

this section is type 00000001 PROGBITS

0x11E8 1b 00 00 00 01 00 00 00 06 00 00 00 00 10 00 00 00 10 00 00
offset 0x1000 in the file 0x1C bytes

here is the program, the machine code.

00001000  00 00 00 eb fe ff ff ea  01 10 81 e2 01 20 82 e2
00001010  01 30 83 e2 01 40 84 e2  fa ff ff ea 41 11

so starting at memory offset 0x1008 which is 8 bytes after the entry point (unfortunately I picked a bad address to use) we need to go 0x8 bytes offset into this data

01 10 81 e2 01 20 82 e2

00001008 <main>:
    1008:   e2811001    add r1, r1, #1
    100c:   e2822001    add r2, r2, #1
    1010:   e2833001    add r3, r3, #1

this is all very file dependent, the cpu could care less about labels, main only means something to the humans, not the cpu.

If I convert the elf into other formats which are perfectly executable:

motorola s record:

S00A0000736F2E7372656338
S1131000000000EBFEFFFFEA011081E2012082E212
S10F1010013083E2014084E2FAFFFFEAB1
S9031000EC

raw binary image

hexdump -C so.bin
00000000  00 00 00 eb fe ff ff ea  01 10 81 e2 01 20 82 e2  |............. ..|
00000010  01 30 83 e2 01 40 84 e2  fa ff ff ea              |.0...@......|
0000001c

The instruction bytes of interest are of course there, but the symbol information isnt. It depends on the file format you are interested in as to 1) if you can find "main" and then 2) print out the first few bytes at that address.

Hmm, a bit disturbing, but if you link for 0x2000 gnu ld burns some disk space and puts the offset at 0x2000, but choose 0x20000000 and it burns more disk space but not as much

000100d0  2f 00 00 00 08 00 00 20  00 00 00 00 10 00 01 00 

shows the file offset is 0x010010 but the address in target space is 0x20000008

00010010  01 30 83 e2 01 40 84 e2  fa ff ff ea 41 11 00 00
00010020  00 61 65 61 62 69 00 01  07 00 00 00 08 01

just to demonstrate/enforce the file offset and the target memory space address are two different things.

this is a very nice format for what you are wanting to do

arm-none-eabi-objcopy -O symbolsrec so.elf so.srec
cat so.srec
$$ so.srec
  $a $20000000
  _bss_end__ $2001001c
  __bss_start__ $2001001c
  __bss_end__ $2001001c
  _start $20000000
  __bss_start $2001001c
  main $20000008
  __end__ $2001001c
  _edata $2001001c
  _end $2001001c
  _stack $80000
  __data_start $2001001c
$$ 
S0090000736F2E686578A1
S31520000000000000EBFEFFFFEA011081E2012082E200
S31120000010013083E2014084E2FAFFFFEA9F
S70520000000DA