Is it possible to use ReadList
to read UTF-8 (or any other) encoded text files using ReadList[..., Word]
, or is it ASCII-only? If it's ASCII-only, is it possible to "fix" the encoding of the already read data with good performance (i.e. preserving the performance advantages of ReadList
over Import
)?
Import[..., CharacterEncoding -> "UTF8"]
works but it's quite a bit slower than ReadList
. $CharacterEncoding
has no effect on ReadList
Download a sample UTF-8 encoded file here.
For testing performance on a large input, see the test file in this question.
Here are the timings of the answers on a large-ish text file:
Import
In[2]:= Timing[
data = Import[file, "Text"];
]
Out[2]= {5.234, Null}
Heike
In[4]:= Timing[
data = ReadList[file, String];
FromCharacterCode[ToCharacterCode[data], "UTF8"];
]
Out[4]= {4.328, Null}
Mr. Wizard
In[5]:= Timing[
string = FromCharacterCode[BinaryReadList[file], "UTF-8"];
]
Out[5]= {2.281, Null}
If I leave out
Word
, this works:This however is a failure, because the data is not read as strings.
Please try this on a larger file and report its performance:
This seems to work
The timings I get for the linked test file are