Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I'm looking for an efficient equivalent of this:
"some really long string".getBytes("UTF-8").length
I need to calculate a length prefix for potentially long serialized messages.
Here's an implementation based on the UTF-8 specification:
public class Utf8LenCounter {
public static int length(CharSequence sequence) {
int count = 0;
for (int i = 0, len = sequence.length(); i < len; i++) {
char ch = sequence.charAt(i);
if (ch <= 0x7F) {
count++;
} else if (ch <= 0x7FF) {
count += 2;
} else if (Character.isHighSurrogate(ch)) {
count += 4;
++i;
} else {
count += 3;
}
}
return count;
}
}
This implementation is not tolerant of malformed strings.
Here's a JUnit 4 test for verification:
public class LenCounterTest {
@Test public void testUtf8Len() {
Charset utf8 = Charset.forName("UTF-8");
AllCodepointsIterator iterator = new AllCodepointsIterator();
while (iterator.hasNext()) {
String test = new String(Character.toChars(iterator.next()));
Assert.assertEquals(test.getBytes(utf8).length,
Utf8LenCounter.length(test));
}
}
private static class AllCodepointsIterator {
private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/
private static final int SURROGATE_FIRST = 0xD800;
private static final int SURROGATE_LAST = 0xDFFF;
private int codepoint = 0;
public boolean hasNext() { return codepoint < MAX; }
public int next() {
int ret = codepoint;
codepoint = next(codepoint);
return ret;
}
private int next(int codepoint) {
while (codepoint++ < MAX) {
if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }
if (!Character.isDefined(codepoint)) { continue; }
return codepoint;
}
return MAX;
}
}
}
Please excuse the compact formatting.
Using Guava's Utf8:
Utf8.encodedLength("some really long string")
The best method I could come up with is to use CharsetEncoder to write repeatedly into the same temporary buffer:
public int getEncodedLength(CharBuffer src, CharsetEncoder encoder)
throws CharacterCodingException
{
// CharsetEncoder.flush fails if encode is not called with >0 chars
if (!src.hasRemaining())
return 0;
// encode into a byte buffer that is repeatedly overwritten
final ByteBuffer outputBuffer = ByteBuffer.allocate(1024);
// encoding loop
int bytes = 0;
CoderResult status;
do
{
status = encoder.encode(src, outputBuffer, true);
if (status.isError())
status.throwException();
bytes += outputBuffer.position();
outputBuffer.clear();
}
while (status.isOverflow());
// flush any remaining buffered state
status = encoder.flush(outputBuffer);
if (status.isError() || status.isOverflow())
status.throwException();
bytes += outputBuffer.position();
return bytes;
}
public int getUtf8Length(String str) throws CharacterCodingException
{
return getEncodedLength(CharBuffer.wrap(str),
Charset.forName("UTF-8").newEncoder());
}
You can loop thru the String:
/**
* Deprecated: doesn't support surrogate characters.
*/
@Deprecated
public int countUTF8Length(String str)
{
int count = 0;
for (int i = 0; i < str.length(); ++i)
{
char c = str.charAt(i);
if (c < 0x80)
{
count++;
} else if (c < 0x800)
{
count +=2;
} else
throw new UnsupportedOperationException("not implemented yet");
}
}
return count;
}