How to cut a String into 1 megabyte subString with

2020-07-24 05:53发布

I have come up with the following:

public static void cutString(String s) {
    List<String> strings = new ArrayList<>();
    int index = 0;
    while (index < s.length()) {
        strings.add(s.substring(index, Math.min(index + 1048576, s.length())));
        index += 1048576;
    }
}

But my problem is, that using UTF-8 some character doesn't exactly take 1 byte, so using 1048576 to tell where to cut the String is not working. I was thinking about maybe using Iterator, but that doesn't seem efficient. What'd be the most efficient solution for this? The String can be smaller than 1 Mb to avoid character slicing, just not bigger than that!

标签: java
2条回答
一夜七次
2楼-- · 2020-07-24 06:17

Quick, unsafe hack

You can use s.getBytes("UTF-8") to get an array with the actual bytes used by each UTF-8 character. Like this:

System.out.println("¡Adiós!".getBytes("UTF-8").length);
// Prints: 9

Once you have that, it's just a matter of splitting the byte array in chunks of length 1048576, and then turn the chunks back into UTF-8 strings with new String(chunk, "UTF-8").

However, by doing it like that you can break multi-byte characters at the beginning or end of the chunks. Say the 1048576th character is a 3-byte Unicode character: the first byte would go into the first chunk and the other two bytes would get put into the second chunk, thus breaking the encoding.

Proper approach

If you can relax the "1 MB" requirement, you can take a safer approach: split the string in chunks of 1048576 characters (not bytes), and then test each chunk's real length with getBytes, removing chars from the end as needed until the real size is equal or less than 1 MB.

Here's an implementation that won't break characters, at the expense of having some lines smaller than the given size:

public static List<String> cutString(String original, int chunkSize, String encoding) throws UnsupportedEncodingException {
    List<String> strings = new ArrayList<>();
    final int end = original.length();
    int from = 0, to = 0;
    do {
        to = (to + chunkSize > end) ? end : to + chunkSize; // next chunk, watch out for small strings
        String chunk = original.substring(from, to); // get chunk
        while (chunk.getBytes(encoding).length > chunkSize) { // adjust chunk to proper byte size if necessary
            chunk = original.substring(from, --to);
        }
        strings.add(chunk); // add chunk to collection
        from = to; // next chunk
    } while (to < end);
    return strings;
}

I tested it with chunkSize = 24 so you could see the effect. It should work as well with any other size:

    String test = "En la fase de maquetación de un documento o una página web o para probar un tipo de letra es necesario visualizar el aspecto del diseño. ٩(-̮̮̃-̃)۶ ٩(●̮̮̃•̃)۶ ٩(͡๏̯͡๏)۶ ٩(-̮̮̃•̃).";

    for (String chunk : cutString(test, 24, "UTF-8")) {
        System.out.println(String.format(
                "Chunk [%s] - Chars: %d - Bytes: %d",
                chunk, chunk.length(), chunk.getBytes("UTF-8").length));
    }
    /*
    Prints:
        Chunk [En la fase de maquetaci] - Chars: 23 - Bytes: 23
        Chunk [ón de un documento o un] - Chars: 23 - Bytes: 24
        Chunk [a página web o para pro] - Chars: 23 - Bytes: 24
        Chunk [bar un tipo de letra es ] - Chars: 24 - Bytes: 24
        Chunk [necesario visualizar el ] - Chars: 24 - Bytes: 24
        Chunk [aspecto del diseño. ٩(] - Chars: 22 - Bytes: 24
        Chunk [-̮̮̃-̃)۶ ٩(●̮̮] - Chars: 14 - Bytes: 24
        Chunk [̃•̃)۶ ٩(͡๏̯͡] - Chars: 12 - Bytes: 23
        Chunk [๏)۶ ٩(-̮̮̃•̃).] - Chars: 14 - Bytes: 24
     */

Another test with a 3 MB string like the one you mention in your comments:

    String string = "0123456789ABCDEF";
    StringBuilder bigAssString = new StringBuilder(1024*1024*3);
    for (int i = 0; i < ((1024*1024*3)/16); i++) {
        bigAssString.append(string);
    }
    System.out.println("bigAssString.length = " + bigAssString.toString().length());
    bigAssString.replace((1024*1024*3)/4, ((1024*1024*3)/4)+1, "á");

    for (String chunk : cutString(bigAssString.toString(), 1024*1024, "UTF-8")) {
        System.out.println(String.format(
                "Chunk [...] - Chars: %d - Bytes: %d",
                chunk.length(), chunk.getBytes("UTF-8").length));
    }
    /*
    Prints:
        bigAssString.length = 3145728
        Chunk [...] - Chars: 1048575 - Bytes: 1048576
        Chunk [...] - Chars: 1048576 - Bytes: 1048576
        Chunk [...] - Chars: 1048576 - Bytes: 1048576
        Chunk [...] - Chars: 1 - Bytes: 1
     */
查看更多
三岁会撩人
3楼-- · 2020-07-24 06:23

You can use a ByteArrayOutputStream with an OutputStreamWriter

   ByteArrayOutputStream out = new ByteArrayOutputStream();
    Writer w = OutputStreamWriter(out, "utf-8");
    //write everything to the writer
    w.write(myString);
    byte[] bytes = out.toByteArray();
    //now you have the actual size of the string, you can parcel by Mb. Be aware that problems may occur however if you have a multi-byte character separated into two locations
查看更多
登录 后发表回答