Conversion from iso-8859-15 (Latin9) to UTF-8?

I need to convert some strings formated with Latin9 charset to UTF-8. I cannot use iconv as it is not included in my embedded system. Do you know if there is some available code for it?

回答1:

Code points 1 to 127 are the same in both Latin-9 (ISO-8859-15) and UTF-8.

Code point 164 in Latin-9 is U+20AC, \xe2\x82\xac = 226 130 172 in UTF-8.
Code point 166 in Latin-9 is U+0160, \xc5\xa0 = 197 160 in UTF-8.
Code point 168 in Latin-9 is U+0161, \xc5\xa1 = 197 161 in UTF-8.
Code point 180 in Latin-9 is U+017D, \xc5\xbd = 197 189 in UTF-8.
Code point 184 in Latin-9 is U+017E, \xc5\xbe = 197 190 in UTF-8.
Code point 188 in Latin-9 is U+0152, \xc5\x92 = 197 146 in UTF-8.
Code point 189 in Latin-9 is U+0153, \xc5\x93 = 197 147 in UTF-8.
Code point 190 in Latin-9 is U+0178, \xc5\xb8 = 197 184 in UTF-8.

Code points 128 .. 191 (except for those listed above) in Latin-9 all map to \xc2\x80 .. \xc2\xbf = 194 128 .. 194 191 in UTF-8.

Code points 192 .. 255 in Latin-9 all map to \xc3\x80 .. \xc3\xbf = 195 128 .. 195 191 in UTF-8.

This means that Latin-9 code points 1..127 are one byte long in UTF-8, code point 164 is three bytes long, and the rest (128..163 and 165..255) are two bytes long.

If you first scan the Latin-9 input string, you can determine the length of the resulting UTF-8 string. If you want or need to -- you're working on an embedded system, after all -- you can then do the conversion in-place, by working backwards from the end towards the start.

Edit:

Here are two functions you can use for the conversion either way. These return a dynamically allocated copy you need to free() after use. They only return NULL when an error occurs (out of memory, errno == ENOMEM). If given a NULL or empty string to convert from, the functions return an empty dynamically allocated string.

In other words, you should always call free() on the pointer returned by these functions when you are done with them. (free(NULL) is allowed and does nothing.)

The latin9_to_utf8() has been verified to produce the exact same output as iconv if the input contains no zero bytes. The function uses standard C strings, i.e. zero byte indicates end of string.

The utf8_to_latin9() has been verified to produce the exact same output as iconv if the input contains only Unicode code points also in ISO-8859-15, and no zero bytes. When given random UTF-8 strings, the function maps the eight code points in Latin-1 to Latin-9 equivalents, i.e. currency sign to euro; iconv either ignores them or considers those errors.

The utf8_to_latin9() behaviour means that the functions are suitable for both Latin 1->UTF-8->Latin 1 and Latin 9->UTF-8->Latin9 round-trips.

#include <stdlib.h>     /* for realloc() and free() */
#include <string.h>     /* for memset() */
#include <errno.h>      /* for errno */

/* Create a dynamically allocated copy of string,
 * changing the encoding from ISO-8859-15 to UTF-8.
*/
char *latin9_to_utf8(const char *const string)
{
    char   *result;
    size_t  n = 0;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s)
            if (*s < 128) {
                s++;
                n += 1;
            } else
            if (*s == 164) {
                s++;
                n += 3;
            } else {
                s++;
                n += 2;
            }
    }

    /* Allocate n+1 (to n+7) bytes for the converted string. */
    result = malloc((n | 7) + 1);
    if (!result) {
        errno = ENOMEM;
        return NULL;
    }

    /* Clear the tail of the string, setting the trailing NUL. */
    memset(result + (n | 7) - 7, 0, 8);

    if (n) {
        const unsigned char  *s = (const unsigned char *)string;
        unsigned char        *d = (unsigned char *)result;

        while (*s)
            if (*s < 128) {
                *(d++) = *(s++);
            } else
            if (*s < 192) switch (*s) {
                case 164: *(d++) = 226; *(d++) = 130; *(d++) = 172; s++; break;
                case 166: *(d++) = 197; *(d++) = 160; s++; break;
                case 168: *(d++) = 197; *(d++) = 161; s++; break;
                case 180: *(d++) = 197; *(d++) = 189; s++; break;
                case 184: *(d++) = 197; *(d++) = 190; s++; break;
                case 188: *(d++) = 197; *(d++) = 146; s++; break;
                case 189: *(d++) = 197; *(d++) = 147; s++; break;
                case 190: *(d++) = 197; *(d++) = 184; s++; break;
                default:  *(d++) = 194; *(d++) = *(s++); break;
            } else {
                *(d++) = 195;
                *(d++) = *(s++) - 64;
            }
    }

    /* Done. Remember to free() the resulting string when no longer needed. */
    return result;
}

/* Create a dynamically allocated copy of string,
 * changing the encoding from UTF-8 to ISO-8859-15.
 * Unsupported code points are ignored.
*/
char *utf8_to_latin9(const char *const string)
{
    size_t         size = 0;
    size_t         used = 0;
    unsigned char *result = NULL;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s) {

            if (used >= size) {
                void *const old = result;

                size = (used | 255) + 257;
                result = realloc(result, size);
                if (!result) {
                    if (old)
                        free(old);
                    errno = ENOMEM;
                    return NULL;
                }
            }

            if (*s < 128) {
                result[used++] = *(s++);
                continue;

            } else
            if (s[0] == 226 && s[1] == 130 && s[2] == 172) {
                result[used++] = 164;
                s += 3;
                continue;

            } else
            if (s[0] == 194 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1];
                s += 2;
                continue;

            } else
            if (s[0] == 195 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1] + 64;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 160) {
                result[used++] = 166;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 161) {
                result[used++] = 168;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 189) {
                result[used++] = 180;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 190) {
                result[used++] = 184;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 146) {
                result[used++] = 188;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 147) {
                result[used++] = 189;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 184) {
                result[used++] = 190;
                s += 2;
                continue;

            }

            if (s[0] >= 192 && s[0] < 224 &&
                s[1] >= 128 && s[1] < 192) {
                s += 2;
                continue;
            } else
            if (s[0] >= 224 && s[0] < 240 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192) {
                s += 3;
                continue;
            } else
            if (s[0] >= 240 && s[0] < 248 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192) {
                s += 4;
                continue;
            } else
            if (s[0] >= 248 && s[0] < 252 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192) {
                s += 5;
                continue;
            } else
            if (s[0] >= 252 && s[0] < 254 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192 &&
                s[5] >= 128 && s[5] < 192) {
                s += 6;
                continue;
            }

            s++;
        }
    }

    {
        void *const old = result;

        size = (used | 7) + 1;

        result = realloc(result, size);
        if (!result) {
            if (old)
                free(old);
            errno = ENOMEM;
            return NULL;
        }

        memset(result + used, 0, size - used);
    }

    return (char *)result;
}

While iconv() is the correct solution for character set conversions in general, the two functions above are certainly useful in an embedded or otherwise constricted environment.

回答2:

It should be relatively easy to create a conversion table from the 128-255 latin9 codes to UTF-8 sequences of bytes. You can even use iconv to do this. Or you can create a file with the 128-255 latin9 codes and convert it to UTF-8 using an appropriate text editor. Then you can use this data to build the conversion table.