What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.
In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?
Please explain in simple terms.
This article explains all the details http://kunststube.net/encoding/
WRITING TO BUFFER
if you write to a 4 byte buffer, symbol
あ
with UTF8 encoding, your binary will look like this:00000000 11100011 10000001 10000010
if you write to a 4 byte buffer, symbol
あ
with UTF16 encoding, your binary will look like this:00000000 00000000 00110000 01000010
As you can see, depending on what language you would use in your content this will effect your memory accordingly.
e.g. For this particular symbol:
あ
UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.READING FROM BUFFER
Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.
e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with
臣
notあ
Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g.
あ
symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.UTF stands for stands for Unicode Transformation Format.Basically in today's world there are scripts written in hundreds of other languages, formats not covered by the basic ASCII used earlier. Hence, UTF came into existence.
UTF-8 has character encoding capabilities and its code unit is 8 bits while that for UTF-16 it is 16 bits.
Why unicode? Because ASCII has just 127 characters. Those from 128 to 255 differ in different countries, that's why there are codepages. So they said lets have up to 1114111 characters. So how do you store the highest codepoint? You'll need to store it using 21 bits, so you'll use a DWORD having 32 bits with 11 bits wasted. So if you use a DWORD to store a unicode character, it is the easiest way because the value in your DWORD matches exactly the codepoint. But DWORD arrays are of course larger than WORD arrays and of course even larger than BYTE arrays. That's why there is not only utf-32, but also utf-16. But utf-16 means a WORD stream, and a WORD has 16 bits so how can the highest codepoint 1114111 fit into a WORD? It cannot! So they put everyything higher than 65535 into a DWORD which they call a surrogate-pair. Such surrogate-pair are two WORDS and can get detected by looking at the first 6 bits. So what about utf-8? It is a byte array or byte stream, but how can the highest codepoint 1114111 fit into a byte? It cannot! Okay, so they put in also a DWORD right? Or possibly a WORD, right? Almost right! They invented utf-8 sequences which means that every codepoint higher than 127 must get encoded into a 2-byte, 3-byte or 4-byte sequence. Wow! But how can we detect such sequences? Well, everything up to 127 is ASCII and is a single byte. What starts with 110 is a two-byte sequence, what starts with 1110 is a three-byte sequence and what starts with 11110 is a four-byte sequence. The remaining bits of these so called "startbytes" belong to the codepoint. Now depending on the sequence, following bytes must follow. A following byte starts with 10, the remaining bits are 6 bits of payload bits and belong to the codepoint. Concatenate the payload bits of the startbyte and the following byte/s and you'll have the codepoint. That's all the magic of utf-8.