I have come up with the following:
public static void cutString(String s) {
List<String> strings = new ArrayList<>();
int index = 0;
while (index < s.length()) {
strings.add(s.substring(index, Math.min(index + 1048576, s.length())));
index += 1048576;
}
}
But my problem is, that using UTF-8 some character doesn't exactly take 1 byte, so using 1048576 to tell where to cut the String is not working. I was thinking about maybe using Iterator, but that doesn't seem efficient. What'd be the most efficient solution for this? The String can be smaller than 1 Mb to avoid character slicing, just not bigger than that!
Quick, unsafe hack
You can use
s.getBytes("UTF-8")
to get an array with the actual bytes used by each UTF-8 character. Like this:Once you have that, it's just a matter of splitting the byte array in chunks of length 1048576, and then turn the chunks back into UTF-8 strings with
new String(chunk, "UTF-8")
.However, by doing it like that you can break multi-byte characters at the beginning or end of the chunks. Say the 1048576th character is a 3-byte Unicode character: the first byte would go into the first chunk and the other two bytes would get put into the second chunk, thus breaking the encoding.
Proper approach
If you can relax the "1 MB" requirement, you can take a safer approach: split the string in chunks of 1048576 characters (not bytes), and then test each chunk's real length with
getBytes
, removing chars from the end as needed until the real size is equal or less than 1 MB.Here's an implementation that won't break characters, at the expense of having some lines smaller than the given size:
I tested it with
chunkSize = 24
so you could see the effect. It should work as well with any other size:Another test with a 3 MB string like the one you mention in your comments:
You can use a ByteArrayOutputStream with an OutputStreamWriter