I have a program that may need to process large files possibly containing multi-byte encodings. My current code for doing this has the problem that creates a memory structure to hold the entire file, which can cause an out of memory error if the file is large:
Charset charset = Charset.forName( "UTF-8" );
CharsetDecoder decoder = charset.newDecoder();
FileInputStream fis = new FileInputStream( file );
FileChannel fc = fis.getChannel();
int lenFile = (int)fc.size();
MappedByteBuffer bufferFile = fc.map( FileChannel.MapMode.READ_ONLY, 0, lenFile );
CharBuffer cb = decoder.decode( bufferFile );
// process character buffer
fc.close();
The problem is that if I chop up the file byte contents using a smaller buffer and feed it piecemeal to the decoder, then the buffer could end in the middle of a multi-byte sequence. How should I cope with this problem?
It is as easy as using a
Reader
.A
CharsetDecoder
is indeed the underlying mechanism which allows the decoding of bytes into chars. In short, you could say that:The less known fact is that most (but not all... See below) default decoders in the JDK (such as those created from a
FileReader
for instance, or anInputStreamReader
with only a charset) will have a policy ofCodingErrorAction.REPLACE
. The effect is to replace any invalid byte sequence in the input with the Unicode replacement character (yes, that infamous �).Now, if you are concerned about the ability for "bad characters" to slip in, you can also select to have a policy of
REPORT
. You can do that when reading a file, too, as follows; this will have the effect of throwing aMalformedInputException
on any malformed byte sequence:ONE EXCEPTION to that default replace action appears in Java 8:
Files.newBufferedReader(somePath)
will try and read in UTF-8, always, and with a default action ofREPORT
.Open and read the file as a text file, so the file reader will do the separation into characters for you. If the file has lines, just read it line by line. If it isn't split into lines, then read in in blocks of 1,000 (or whatever) characters. Let the file library deal with the low-level stuff of converting the UTF multi-byte sequences into characters.
@fge, I didn't know about the report option - cool. @Tyler, the trick, I think, is using the BufferedReader's read() method: Excerpt from here: https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#read%28char[],%20int,%20int%29
Here is some example output (code below):
Note on the output above it happened to end with the last '7' characters; you can adjust the buffer array size to process whatever "chunk" size you want... this is just an example to suggest you wont' have to worry about getting stuck somewhere "mid-byte" in a multi-byte UTF8 character.