The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding".
In fact, it manages to represent the first 127 characters of US-ASCII in just one byte which looks exactly like real ASCII, so you can interpret lots of ascii text as if it were UTF-8 without doing anything to it. Neat trick. So how does it work?
I'm going to ask and answer my own question here because I just did a bit of reading to figure it out and I thought it might save somebody else some time. Plus maybe somebody can correct me if I've got some of it wrong.
Each byte starts with a few bits that tell you whether it's a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this:
0xxx xxxx A single-byte US-ASCII code (from the first 127 characters)
The multi-byte code-points each start with a few bits that essentially say "hey, you need to also read the next byte (or two, or three) to figure out what I am." They are:
110x xxxx One more byte follows
1110 xxxx Two more bytes follow
1111 0xxx Three more bytes follow
Finally, the bytes that follow those start codes all look like this:
10xx xxxx A continuation of one of the multi-byte characters
Since you can tell what kind of byte you're looking at from the first few bits, then even if something gets mangled somewhere, you don't lose the whole sequence.
RFC3629 - UTF-8, a transformation format of ISO 10646 is the final authority here and has all the explanations.
In short, several bits in each byte of the UTF-8-encoded 1-to-4-byte sequence representing a single character are used to indicate whether it's a trailing byte, a leading byte, and if so, how many bytes follow. The remaining bits contain the payload.
UTF-8 was another system for storing
your string of Unicode code points,
those magic U+ numbers, in memory
using 8 bit bytes. In UTF-8, every
code point from 0-127 is stored in a
single byte. Only code points 128 and
above are stored using 2, 3, in fact,
up to 6 bytes.
Excerpt from The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)