We all know that in x86 arch, the data and code is mixed in the memory or disk. But how to tell them?
The method is needed for paper, I wouldn't expect a 100% accuracy. 80%'s just ok, even some ideas would be fine:)
We all know that in x86 arch, the data and code is mixed in the memory or disk. But how to tell them?
The method is needed for paper, I wouldn't expect a 100% accuracy. 80%'s just ok, even some ideas would be fine:)
Statistically determine which commands are common in executables.
Eg. some commands may be add/subtract etc.
For the unknown binary sequence, treat it like machine code, and look at the frequency of the various commands used (here you can probably assume commands start correctly at byte boundaries).
If an invalid command is used, obviously it is not machine code.
Otherwise, see whether the percentage frequency of commands used matches what would be usual.
Also, when a command is used which accepts addresses (eg. registers or memory/data locations), record them. Then check if the same locations are being accessed nearby.
This can be done by sorting any data locations used by frequency of usage descending, and seeing of the shape of the decreasing frequency somewhat matches what might be usual.
Data (non-machine code) is unlikely to match these statistical tests.
Do note that when I say fit, you can check for very loose fits. Even if it is quite a bit off what is normal, it probably still is code, unless there is almost no correlation statistically.
See Is all data valid x86 16-bit machine code?.
ndisasm -m 32 > program.dump
(use 16, 32 or 64 when applicable of course)cut -b29- < program.dump > program.dump2
grep -v '^$' < program.dump2 > program.asm
grep -l '^db' < program.asm > /dev/null; echo $?
0
, it is not all instructions (grep
found something). If you don't, it is :)