Loading program from RAM in 8086

The 8086 is using 16-bit instruction but the RAM addresses only hold 8-bit how does the CPU load programms from the RAM then ? Does it load one address and then checks if the instruction needs 1/2/3 bytes (e.g. moving a immediate to a register 8/16 bit) and then executes the operation or am I getting it wrong that one RAM 'space' is 16-bit big ?

Many instructions are multi-byte, and yes that means they span two or more addresses.

IIRC, 8086's memory bus is 16-bit, so it can load 16 bits (two adjacent addresses) in a single operation. You're confusing byte-addressable memory with the bus width.

Does it load one address and then checks if the instruction needs 1/2/3 bytes (e.g. moving a immediate to a register 8/16 bit)

It continually fetches instruction bytes into a 6-byte buffer (2 bytes at a time, because it's a 16-bit CPU with 16-bit busses). The buffer is large enough to hold the largest allowed 8086 instruction (excluding prefixes, which might be decoded separately, IDK). When it's done executing the previous instruction, it looks at the buffer. See the link below for a better description, but it probably tries to decode the buffer as a whole instruction. If it hits the end of the fetch buffer before finding the end of the instruction, it waits until the next fetch cycle has completed and tries again.

See also: 8086 CPU architecture, which was the first hit for "8086 code fetch". It confirms that fetch and execute do overlap, so it's pipelined in the most basic way.

TL:DR: It fetches into a buffer until it has a whole instruction to decode. Then it shifts any extra bytes to the front of the buffer, because they're part of the next instruction.

I've read that usually instruction-fetch is the bottleneck for 8086, so optimizing for code-size outweighed pretty much everything else.

A pipelined CPU wouldn't have to wait for execution of the previous instruction to finish to get started on decoding. Modern CPUs also have much higher bandwidth code-fetch, so they have a queue of decoded instructions ready to go (except when branches mess this up.) See http://agner.org/optimize/, and other links in the x86 tag wiki.

Also, some very common instructions are a single byte, like push r16.