I am reading data from a file that has, unfortunately, two types of character encoding.
There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.
The header is not fixed length and must be run through a parser to determine its content/length.
The file may also be quite large so I need to avoid bring the entire content into memory.
So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.
Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.
Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.
Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)
Thanks in advance.
EDIT: My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.
// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
private final CharsetDecoder charsetDecoder;
private final InputStream inputStream;
private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );
public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
{
this.inputStream = inputStream;
charsetDecoder = charset.newDecoder();
}
@Override
public int read() throws IOException
{
boolean middleOfReading = false;
while ( true )
{
int b = inputStream.read();
if ( b == -1 )
{
if ( middleOfReading )
throw new IOException( "Unexpected end of stream, byte truncated" );
return -1;
}
byteBuffer.clear();
byteBuffer.put( (byte)b );
byteBuffer.flip();
CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );
// although this is theoretically possible this would violate the unbuffered nature
// of this class so we throw an exception
if ( charBuffer.length() > 1 )
throw new IOException( "Decoded multiple characters from one byte!" );
if ( charBuffer.length() == 1 )
return charBuffer.get();
middleOfReading = true;
}
}
public int read( char[] cbuf, int off, int len ) throws IOException
{
for ( int i = 0; i < len; i++ )
{
int ch = read();
if ( ch == -1 )
return i == 0 ? -1 : i;
cbuf[ i ] = (char)ch;
}
return len;
}
public void close() throws IOException
{
inputStream.close();
}
}
If you wrap the InputStream and limit all reads to just 1 byte at a time, it seems to disable the buffering inside of InputStreamReader.
This way we don't have to rewrite the InputStreamReader logic.
To construct:
Here is the pseudo code.
InputStream
, but do not wrap aReader
around it.ByteArrayOutputStream
.ByteArrayInputStream
fromByteArrayOutputStream
and decode header, this time wrapByteArrayInputStream
intoReader
with ASCII charset.ByteArrayOutputStream
.ByteArrayInputStream
from the secondByteArrayOutputStream
and wrap it withReader
with charset from the header.Why don't you use 2
InputStream
s? One for reading the header and another for the body.The second
InputStream
shouldskip
the header bytes.It's even easier:
As you said, your header is always in ASCII. So read the header directly from the InputStream, and when you're done with it, create the Reader with the correct encoding and read from it
My first thought is to close the stream and reopen it, using
InputStream#skip
to skip past the header before giving the stream to the newInputStreamReader
.If you really, really don't want to reopen the file, you could use file descriptors to get more than one stream to the file, although you may have to use channels to have multiple positions within the file (since you can't assume you can reset the position with
reset
, it may not be supported).I suggest rereading the stream from the start with a new
InputStreamReader
. Perhaps assume thatInputStream.mark
is supported.