“Finding all the code in a given binary is equival

2020-03-01 16:30发布

问题:

Was just reading the highly voted question regarding Emulators and the statement

It's been proven that finding all the code in a given binary is equivalent to the Halting problem.

Really stuck out at me.

Surely that can't be true? Isn't it just a large dependency graph?

Would be really grateful for some further insight into this statement.

回答1:

I believe what is meant is "finding all code that is ever executed", i.e. finding coverage, possibly in combination with dynamically generated code. That can indeed be reduced to the halting problem.

Say that you have a perfect coverage tool that will find every piece of code in a program that may ever be executed (so the rest is dead code). Given a program P, this tool would also be able to decide whether the extended program (P ; halt) ever executes the halt instruction, or whether the halt part is dead code. So, it would solve the halting problem, which we know is undecidable.



回答2:

I disagree with larsman.

The halting problem says that no program P exists that can take any program and decide whether that program executes the halt instruction. Let me quote wikipedia:

Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist. We say that the halting problem is undecidable over Turing machines.

On the other hand we're not trying to make such program/algorithm, but we're trying to find all the code in this/these specific program(s). If we reverse-engineer the program and see that it immediately calls exit() (very optimistic example situation) we have proven that it will call halt, while it was impossible?!

If we we're trying to build an emulator that can run any program we would fail since then you can (easily) reduce that to the Halting problem. But usually you are building an emulator for something like a Game Boy which supports a finite amount of game cartridges (programs) and thus it is possible.



回答3:

The halting problem for finite state machines is solvable (although it might take many lifetimes.....of the universe), and any machine you might be emulating is a finite state machine. Just run the program, and the number of steps is bounded by the number of possible states; if that number is exceeded without halting then the program will never halt, since it must be in a loop.

Practically speaking, finding all code is a much easier problem unless the code can use computed gotos. Rather than running the code, you simply take all branches exactly once at each possible branch point.



回答4:

I agree with Larsman, and I believe the argument can be made precise. First, I have to agree that "finding all the code in a given binary" means, in this context, having a single computable function that determines which bytes within an input binary correspond to instructions that are executed.

EDIT: If anyone is not clear on why there is a problem here, think about obfuscated machine code. Disassembly of such code is a non-trivial problem. If you begin disassembly in the middle of a multi-byte instruction, you get the wrong disassembly. This breaks not only the instruction in question, but typically a few instructions beyond that. etc. look into it. (for example, google "obfuscation and disassembly").

Sketch of strategy to make this precise:

First, define a theoretical computer / machine model in which a program is encoded in multi-bit binary instructions, much like machine code on "real" computers, but made precise (and such that it is turing complete). Assume that the binary encodes the program and all input. This must all be made precise, but assume that the language has a (binary encoded) halt instruction, and that a program halts if and only if it executes a 'halt' instruction.

First, let us assume that the machine is not able to alter the program code, but has a memory in which to work. (assumed infinite in interesting cases).

Then any given bit will either be at the start of an instruction that is run during execution of the program, or it will not. (e.g. it may be in the middle of an instruction, or in data or "junk code" -- that is, code that will never run. Note that I have not claimed these are mutually exclusive, as, for example, a jump instruction can have a target that is in the middle of a particular instruction, thereby creating another instruction that, "on first pass" did not look like an executed instruction.).

Define ins(i, j) to be the function that returns 1 if the binary i has an instruction beginning at bit-position j, that executes in a run of the program encoded by i. Define ins(i,j) to be 0 otherwise. suppose ins(i,j) is computable.

Let halt_candidate(i) be the following program:

for j = 1 to length(i)
  if(i(j..j+len('halt')-1) == 'halt')
    if(ins(i,j) == 1)
      return 1
return 0

Since we disallow self-modifying code above, they only way for a program to halt is for there to be a 'halt' instruction at some position j that gets executed. By design, halt_candidate(i) computes the halting function.

This provides a very rough sketch of one direction of the proof. i.e. if we assume we can test whether program i has an instruction at location j for all j, then we can code the halting function.

For the other direction, one must encode an emulator of the formal machine, within the formal machine. Then, create a program plus inputs i and j which emulates program i and when an instruction at bit position j is executed, it returns 1 and halts. When any other instruction is executed it continues to run, and if the simulation ever simulates a 'halt' function in i, the emulator jumps to an infinite loop. Then ins(i,j) is equivalent to halt(emulator(i,j)), and so the halting problem implies the find code problem.

Of course one must assume a theoretical computer for this to be equivalent to the famously unsolvable halting problem. Otherwise, for a "real" computer, the halting problem is solvable but intractable.

For a system that allows self-modifying code the argument is more complex, but I expect not that different.

EDIT: I believe a proof in the self-modifying case would be to implement an emulator of the system-plus-self-modifying-code in the static code plus data system above. Then halting with self-modifying code allowed for program i with data x, is the same as halting in the static code plus data system above with the binary containing the emulator, i, and x as code plus data.