I have a file which is described under Unix as:
$file xxx.csv
xxx.csv: UTF-8 Unicode text, with very long lines
Viewing it in less
/vi
will render some special chars (ßÄ°...) unreadable (├╝); Windows will also not display it; importing it directly into a db will just change the special characters to some other special characters (+ä, +ñ, ...).
I wanted to convert it now to a "default readable" encoding with iconv. When I try to convert it with iconv
$iconv -f UTF-8 -t ISO-8859-1 xxx.csv > yyy.csv
iconv: illegal input sequence at position 1234
using UNICODE as input and UTF-8 as output will return the same message
I am guessing the file is somewhat encoded in another format which I do not know - how can I find out which format in order to convert it to something "universally" readable ...
If you are not sure about the file type you dealing with then you can find it as follows,
The above command will give you the file format. Then iconv can be used accordingly. For example if the file format is UTF-16 and you want to convert it to UTF-8 then following can be used.
Hope this gives add on insight to what you are looking for.
The problem was that Windows could not interpret the file as UTF-8 on itself. it reads it as asci and then ä becomes a 2 character interpretation ä (ascii 195 164)
trying to convert it, I found a solution that works for me:
now I can view the special chars correctly in editors
For SQLServer compability, converting UTF-8 to UTF-16 will work even better ... just the filesize grows quite a bit
Converting from UTF-8 to ISO-8859-1 only works if your UTF-8 text only has characters that can be represented in ISO-8859-1. If this is not the case, you should specify what needs to happen to these characters, either ignoring (//IGNORE) or approximating (//TRANSLIT) them. Try one of these two:
In most cases, I guess approximation is the best solution, mapping e.g. accented characters to their unaccented counterparts, the euro sign to EUR, etc...