I've been experimenting with a custom string object (struct) which looks like this:
typedef struct
{
int encoding;
int length;
character * array;
} EncodedString;
The idea is that by specifying the encoding, I can make a few functions which use that encoding to print the string correctly, i.e. ASCII or utf-8 or utf-16, etc. (Excuse my character encoding ignorance.)
Right now, I'm trying to print out one (Mandarin) Chinese character: 狗 (0x72d7). I thought perhaps by printing it character by character, it would work properly, but obviously not. It printed just "r?" (0x72 and 0xd7, respectively). So how can I amend this program so that it prints the character?
#include <stdio.h>
typedef unsigned char character;
typedef struct
{
int encoding;
int length;
character * array;
} EncodedString;
void printString(EncodedString str);
int main(void)
{
character doginmandarin[] = {0x72U, 0xd7U};
EncodedString mystring = {0, sizeof doginmandarin, doginmandarin};
printString(mystring);
printf("\n");
return 0;
}
void printString(EncodedString str) // <--- where I try to print the character
{
int i;
for(i = 0; i < str.length; i++)
{
printf("%c", str.array[i]);
}
}
Ideally, I would prefer if I the array containing the characters only contains unsigned chars, which means separating the two bytes making up the character 狗. Although it's not serving any purpose now, the idea is to use the encoding
field of the EncodedString
struct to determine how many bytes each character is.
How can this be implemented with the least amount of hacks?
The number Ox72d7
is the Unicode code point (abstract number) for the character you want to print. When represented in memory with two bytes 0x72, 0xd7
, it becomes the UCS-2 code for that character which also happens to be its UTF-16 encoding. But your terminal is probably expecting UTF-8 encoded characters. The correct UTF-8 encoding for the code point Ox72d7
is 0xe7, 0x8b, 0x97
.
You could fix your code to use UTF-8 encoded characters but this encoding is very impractical for memory representation since it produces different numbers of bytes for different characters. This makes simple string operations like getting the nth character very complicated. Instead, fixed-length representations are often used. For example UCS-2 always uses two bytes per character. The conversion to the external representation encoding is then done as late as possible, just before printing the strings.
EDIT (from the comments)
UTF-8 is a tricky encoding. The mapping from code points to UTF-8 bytes is not trivial and involves some bitwise mumbo-jumbo. It's a kind of Huffman code, different prefixes tell how many bytes the character will occupy. Also all the following bytes start with 0b10 in order to detect malformed UTF-8. It's described here: http://en.wikipedia.org/wiki/UTF-8#Description
In order to find the three bytes quickly for my post I just typed this in a python console: u"\u72d7".encode('UTF-8')
You should probably look into the c-library functions that have to do with wide-characters (wchar_t) and multibyte strings. The c-library implementation on linux (or windows as far as i know) is compatible with unicode. (If you need this on your microcontroller board, you might be out of luck, though). Most of the things that deal with utf-8 encodings and unicode are already in there, so you do not need to do it yourself.
Here is an example how you could deal with one character:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main ()
{
/*
* use an utf-8 compatible locale.
*/
setlocale (LC_ALL, "en_US.utf8");
const wchar_t dog = 0x72d7;
/*
* wchar_t strings can contain any character. Create one
* string containing only the dog.
*/
wchar_t in[2] = { dog, 0 };
char out[100];
/*
* convert to a multibyte string, returns the number of chars.
*/
size_t len = wcstombs (out, in, sizeof out);
printf ("the character '%lc' is %zd bytes (string: '%s')\n", dog, len, out);
}
Output:
$ ./a.out
the character '狗' is 3 bytes (string: '狗')