I need to convert some strings formated with Latin9 charset to UTF-8. I cannot use iconv as it is not included in my embedded system. Do you know if there is some available code for it?
相关问题
- Multiple sockets for clients to connect to
- Is shmid returned by shmget() unique across proces
- What is the best way to do a search in a large fil
- glDrawElements only draws half a quad
- how to get running process information in java?
Code points
1
to127
are the same in both Latin-9 (ISO-8859-15) and UTF-8.Code point
164
in Latin-9 is U+20AC, \xe2\x82\xac =226 130 172
in UTF-8.Code point
166
in Latin-9 is U+0160, \xc5\xa0 =197 160
in UTF-8.Code point
168
in Latin-9 is U+0161, \xc5\xa1 =197 161
in UTF-8.Code point
180
in Latin-9 is U+017D, \xc5\xbd =197 189
in UTF-8.Code point
184
in Latin-9 is U+017E, \xc5\xbe =197 190
in UTF-8.Code point
188
in Latin-9 is U+0152, \xc5\x92 =197 146
in UTF-8.Code point
189
in Latin-9 is U+0153, \xc5\x93 =197 147
in UTF-8.Code point
190
in Latin-9 is U+0178, \xc5\xb8 =197 184
in UTF-8.Code points
128 .. 191
(except for those listed above) in Latin-9 all map to \xc2\x80 .. \xc2\xbf =194 128 .. 194 191
in UTF-8.Code points
192 .. 255
in Latin-9 all map to \xc3\x80 .. \xc3\xbf =195 128 .. 195 191
in UTF-8.This means that Latin-9 code points 1..127 are one byte long in UTF-8, code point 164 is three bytes long, and the rest (128..163 and 165..255) are two bytes long.
If you first scan the Latin-9 input string, you can determine the length of the resulting UTF-8 string. If you want or need to -- you're working on an embedded system, after all -- you can then do the conversion in-place, by working backwards from the end towards the start.
Edit:
Here are two functions you can use for the conversion either way. These return a dynamically allocated copy you need to
free()
after use. They only returnNULL
when an error occurs (out of memory,errno == ENOMEM
). If given aNULL
or empty string to convert from, the functions return an empty dynamically allocated string.In other words, you should always call
free()
on the pointer returned by these functions when you are done with them. (free(NULL)
is allowed and does nothing.)The
latin9_to_utf8()
has been verified to produce the exact same output asiconv
if the input contains no zero bytes. The function uses standard C strings, i.e. zero byte indicates end of string.The
utf8_to_latin9()
has been verified to produce the exact same output asiconv
if the input contains only Unicode code points also in ISO-8859-15, and no zero bytes. When given random UTF-8 strings, the function maps the eight code points in Latin-1 to Latin-9 equivalents, i.e. currency sign to euro; iconv either ignores them or considers those errors.The
utf8_to_latin9()
behaviour means that the functions are suitable for bothLatin 1
->UTF-8
->Latin 1
andLatin 9
->UTF-8
->Latin9
round-trips.While
iconv()
is the correct solution for character set conversions in general, the two functions above are certainly useful in an embedded or otherwise constricted environment.It should be relatively easy to create a conversion table from the 128-255 latin9 codes to UTF-8 sequences of bytes. You can even use iconv to do this. Or you can create a file with the 128-255 latin9 codes and convert it to UTF-8 using an appropriate text editor. Then you can use this data to build the conversion table.