I can't think of a way to remove the leading zeros. My goal was in a for
loop to then create the UTF-8 and UTF-32 versions of each number.
For example, with UTF-8 wouldn't I have to remove the leading zeros? Does anyone have a solution for how to pull this off? Basically what I am asking is: does someone have a easy solution to convert Unicode code points to UTF-8?
for (i = 0x0; i < 0xffff; i++) {
printf("%#x \n", i);
//convert to UTF8
}
So here is an example of what I am trying to accomplish for each i
.
- For example: Unicode value U+0760 (Base 16) would convert to UTF8 as
- in binary: 1101 1101 1010 0000
- in hex: DD A0
Basically I am trying to do that for every i
is convert it to its hex equivalent in UTF-8.
The problem I am running into is it seems the process for converting Unicode to UTF-8 involves removing leading 0s from the bit number. I am not really sure how to do that dynamically.
Many ways to do this fun exercise, converting a code point to UTF-8.
As not to give it all the coding experience away, following is a pseudo code to get OP started.
Converting to UTF-32 is trivial, it's just the Unicode code point.
Note that I'm using
wint_t
, w for "wide". That's an integer which is guaranteed to be large enough to hold anywchar_t
as well as EOF.wchar_t
(wide character) is guaranteed to be wide enough to support all system locales.Converting to UTF-8 is a bit more complicated because of its codepage layout designed to be compatible with 7-bit ASCII. Some bit shifting is required.
Start with the UTF-8 table.
Turn that into a big if/else if statement.
And start filling in the blanks. The first one is easy, it's just the code point.
To do the next one, we need to apply a bit mask and do some bit shifting. C doesn't support binary literals, so I converted the binary into hex using
perl -wle 'printf("%x\n", 0b1100000010000000)'
I'll leave the rest to you.
We can test this with various interesting values that touch each piece of logic.
This is an interesting exercise, but if you want to do this for real use a pre-existing library. Gnome Lib has Unicode manipulation functions, and a lot more missing pieces of C.
As the Wikipedia UTF-8 page describes, each Unicode code point (0 through 0x10FFFF) is encoded in UTF-8 character as one to four bytes.
Here is a simple example function, edited from one of my earlier posts. I've now removed the
U
suffixes from the integer constants too. (.. whose intent was to remind the human programmer that the constants are explicitly unsigned for a reason (negative code points not considered at all), and it does assume unsigned intcode
-- the compiler does not care, and probably because of that this practice seems to be odd and confusing even to long-term members here, so I give up and stop trying to include such reminders. :( )You supply it with an unsigned char array, four chars or larger, and the Unicode code point. The function will return how many chars were needed to encode the code point in UTF-8, and were assigned in the array. The function will return 0 (not encoded) for codes above
0x10FFFF
, but it does not otherwise check that the Unicode code point is valid. Ie. it is a simple encoder, and all it knows about Unicode is that the code points are from0
to0x10FFFF
, inclusive. It knows nothing about surrogate pairs, for example.Note that because the code point is explicitly an unsigned integer, negative arguments will be converted to unsigned according to C rules.
You need to write a function that prints out the least 8 significant bits in each unsigned char (the C standard does allow larger char sizes, but UTF-8 only uses 8-bit chars). Then, use the above function to convert an Unicode code point (
0
to0x10FFFF
, inclusive) to UTF-8 representation, and call your bit function for each unsigned char in the array, in increasing order, for the count of unsigned char the above conversion function returned for that code point.