So, I've been trying to write an emulator, or at least understand how stuff works. I have a decent grasp of assembly, particularly z80 and x86, but I've never really understood how an object file (or in my case, a .gb ROM file) indicates the start and end of an instruction.
I'm trying to parse out the opcode for each instruction, but it occurred to me that it's not like there's a line break after every instruction. So how does this happen? To me, it just looks like a bunch of bytes, with no way to tell the difference between an opcode and its operands.
For most CPUs - and I believe Z80 falls in this category - the length of an instruction is implicit.
That is, you must decode the instruction in order to figure out how long it is.
If you're writing an emulator you don't really ever need to be able to obtain a full disassembly. You know what the program counter is now, you know whether you're expecting a fresh opcode, an address, a CB page opcode or whatever and you just deal with it. What people end up writing, in effect, is usually a per-opcode recursive descent parser.
To get to a full disassembler, most people impute some mild simulation, recursively tracking flow. Instructions are found, data is then left by deduction.
Not so much on the GB where storage was plentiful (by comparison) and piracy had a physical barrier, but on other platforms it was reasonably common to save space or to effect disassembly-proof code by writing code where a branch into the middle of an opcode would create a multiplexed second stream of operations, or where the same thing might be achieved by suddenly reusing valid data as valid code. One of Orlando's 6502 efforts even re-used some of the loader text — regular ASCII — as decrypting code. That sort of stuff is very hard to crack because there's no simple assembly for it and a disassembler therefore usually won't be able to figure out what to do heuristically. Conversely, on a suitably accurate emulator such code should just work exactly as it did originally.