I'm working on an x86 asm obfuscator that takes Intel-syntax code as a string and outputs an equivilent set of opcodes that are obfuscated.
Here's an example:
mov eax, 0x5523
or eax, [ebx]
push eax
call someAPI
Becomes something like:
mov eax, 0xFFFFFFFF ; mov eax, 0x5523
and eax, 0x5523 ;
push [ebx] ; xor eax, [ebx]
or [esp], eax ;
pop eax ;
push 12345h ; push eax
mov [esp], eax ;
call getEIP ; call someAPI
getEIP: ;
add [esp], 9 ;
jmp someAPI ;
This is just an example, I've not checked that this doesn't screw up flags (it probably does).
Right now I have an XML document that lists instruction templates (e.g. push e*x
) and a list of replacement instructions that can be used.
What I'm looking for is a way to automatically generate opcode sequences that produce the same result as an input. I don't mind doing an educated bruteforce, but I'm not sure how I'd approach this.
What you need is an algebraic descripton of what the opcodes do, and a set of algebraic laws that allow you to determine equivalent operations.
Then for each instruction, you look up its algebraic description (for the sake of an example, an
whose algebraic equivalent is
enumerate algebraic equivalences using those algebra equivalents, such as:
to generate equivalent algebraic statement for your XOR instruction
You may apply more algebraic laws to this, for instance de morgans' theorem:
to get
At this point you have a specification of an algebraic computation that will do the same thing as the original. There's your brute force.
Now you have to "compile" this to machine instructions by matching what instructions will do with what this says. Like any compiler, you likely want to optimize the generated code (no point in fetching mem[ecx] twice). (All of this hard... its a code generator!) The resulting code sequence would be something like:
This is a lot of machinery to build manually.
Another way to do this is to take advantage of a program transformation system that allows you to apply source-to-source transformations to code. Then you can encode "equivalences" as rewrites directly on the code.
One of these tools is our DMS Software Reengineering Toolkit.
DMS takes a langauge definition (essentially as an EBNF), automatically implements a parser, AST builder, and prettyprinter (anti parser, turning AST back into valid source text). [DMS doesn't presently have an EBNF for ASM86, but dozens of EBNFs for various complex langauges have been build for DMS including several for miscellaneous non-x86 assemblers So you'd have to define the ASM86 EBNF to DMS. This is pretty straightforward; DMS has a really strong parser generator].
Using that, DMS will let you write source transformations directly on the code. You could write the following transformations that implement the XOR equivalant and DeMorgan's law directly:
with some additional magic in a meta-procedure called "free_register" that determines what registers are free at that point (of the AST match) in the code. (If you don't want to do that, use the top of the stack as temporary everywhere as you did in your example).
You'd need a bunch of rewrites to cover all the cases that you want to obfuscate, with thier combinatorics with registers and memory operands.
Then the transformation engine can be asked to apply these transformations randomly once or more than once at each point in the code to scramble it.
You can see a fully worked example of some algebraic transforms being applied with DMS.