Java - Fastest way to check the size of String

I have the following code inside a loop statement.
In the loop, strings are appended to sb(StringBuilder) and checked whether the size of sb has reached 5MB.

if (sb.toString().getBytes("UTF-8").length >= 5242880) {
    // Do something
}

This works fine, but it is very slow(in terms of checking the size)
What would be the fastest way to do this?

标签： java utf-8 java-8

3条回答

家丑人穷心不美

2楼-- · 2019-03-20 04:52

If you loop 1000 times, you will generate 1000String, then convert into "UTF-8 Byte" array, to get the length.

I would reduce the conversion by storing the first length. Then, on each loop, get the length of the added value only, then this is just an addition.

int length = sb.toString().getBytes("UTF-8").length;
for(String s : list){
    sb.append(s);
    length += s.getBytes("UTF-8").length;
    if(...){
    ...
    }
}

This would reduce the memory used and the conversion cost

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-03-20 05:12

Consider using a ByteArrayOutputStream and an OutputStreamWriter instead of the StringBuilder. Use ByteArrayOutputStream.size() to test the size.

0人赞添加讨论(0) 举报

看我几分像从前

4楼-- · 2019-03-20 05:17

You can calculate the UTF-8 length quickly using

public static int utf8Length(CharSequence cs) {
    return cs.codePoints()
        .map(cp -> cp<=0x7ff? cp<=0x7f? 1: 2: cp<=0xffff? 3: 4)
        .sum();
}

If ASCII characters dominate the contents, it might be slightly faster to use

public static int utf8Length(CharSequence cs) {
    return cs.length()
         + cs.codePoints().filter(cp -> cp>0x7f).map(cp -> cp<=0x7ff? 1: 2).sum();
}

instead.

But you may also consider the optimization potential of not recalculating the entire size, but only the size of the new fragment you’re appending to the StringBuilder, something alike

    StringBuilder sb = new StringBuilder();
    int length = 0;
    for(…; …; …) {
        String s = … //calculateNextString();
        sb.append(s);
        length += utf8Length(s);
        if(length >= 5242880) {
            // Do something

            // in case you're flushing the data:
            sb.setLength(0);
            length = 0;
        }
    }

This assumes that if you’re appending fragments containing surrogate pairs, they are always complete and not split into their halves. For ordinary applications, this should always be the case.

An additional possibility, suggested by Didier-L, is to postpone the calculation until your StringBuilder reaches a length of the threshold divided by three, as before that, it is impossible to have a UTF-8 length greater than the threshold. However, that will be only beneficial if it happens that you don’t reach threshold / 3 in some executions.

0人赞添加讨论(0) 举报

Java - Fastest way to check the size of String

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间