How can I read a utf-8 encoded text file in Mathematica?
This is what I'm doing now:
text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];
but it tells me that
$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"
and so on. I am not sure why. I believe the file is valid utf-8.
Here's the file I'm trying to read:
http://dl.dropbox.com/u/38623/charData.txt
Short version: Mathematica's UTF-8 functionality does not work for character codes with more than 16 bits. Use UTF-16 encoding instead, if possible. But be aware that Mathematica's treatment of 17+ bit character codes is generally buggy. The long version follows...
As noted by numerous commenters, the problem appears to be with Mathematica's support for Unicode characters whose codes are larger than 16 bits. The first such character in the cited text file is U+20B9B (