My software is getting some strings in UTF8 than I need to convert to ISO 8859 1. I know that UTF8 domain is bigger than iso 8859. But the data in UTF8 has been previously upconverted from ISO, so I should not miss anything.
I would like to know if there is an easy / direct way to convert from UTF8 to iso-8859-1.
Thanks
Here is a function you might find useful:
utf8_to_latin9()
. It converts toISO-8859-15
(including EURO, whichISO-8859-1
does not have), but also works correctly for theUTF-8
->ISO-8859-1
conversion part of aISO-8859-1
->UTF-8
->ISO-8859-1
round-trip.The function ignores invalid code points similar to
//IGNORE
flag for iconv, but does not recompose decomposed UTF-8 sequences; that is, it won't turnU+006E U+0303
intoU+00F1
. I don't bother recomposing because iconv does not either.The function is very careful about the string access. It will never scan beyond the buffer. The output buffer must be one byte longer than length, because it always appends the end-of-string NUL byte. The function returns the number of characters (bytes) in output, not including the end-of-string NUL byte.
Note that you can add custom transliteration for specific code points in the
to_latin9()
function, but you are limited to one-character replacements.As it is currently written, the function can do in-place conversion safely: input and output pointers can be the same. The output string will never be longer than the input string. If your input string has room for an extra byte (for example, it has the NUL terminating the string), you can safely use the above function to convert it from UTF-8 to ISO-8859-1/15. I deliberately wrote it this way, because it should save you some effort in an embedded environment, although this approach is a bit limited wrt. customization and extension.
Edit:
I included a pair of conversion functions in an edit to this answer for both Latin-1/9 to/from UTF-8 conversion (ISO-8859-1 or -15 to/from UTF-8); the main difference is that those functions return a dynamically allocated copy, and keep the original string intact.
tocode
is"ISO_8859-1"
andfromcode
is"UTF-8"
.Working example: