So I know about String#codePointAt(int)
, but it\'s indexed by the char
offset, not by the codepoint offset.
I\'m thinking about trying something like:
- using
String#charAt(int)
to get the char
at an index
- testing whether the
char
is in the high-surrogates range
- if so, use
String#codePointAt(int)
to get the codepoint, and increment the index by 2
- if not, use the given
char
value as the codepoint, and increment the index by 1
But my concerns are
- I\'m not sure whether codepoints which are naturally in the high-surrogates range will be stored as two
char
values or one
- this seems like an awful expensive way to iterate through characters
- someone must have come up with something better.
Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.
If you know you\'ll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
Java 8 added CharSequence#codePoints
which returns an IntStream
containing the code points.
You can use the stream directly to iterate over them:
string.codePoints().forEach(c -> ...);
or with a for loop by collecting the stream into an array:
for(int c : string.codePoints().toArray()){
...
}
These ways are probably more expensive than Jonathan Feinbergs\'s solution, but they are faster to read/write and the performance difference will usually be insignificant.
Iterating over code points is filed as a feature request at Sun.
See Sun Bug Entry
There is also an example on how to iterate over String CodePoints there.
Thought I\'d add a workaround method that works with foreach loops (ref), plus you can convert it to java 8\'s new String#codePoints method easily when you move to java 8:
You can use it with foreach like this:
for(int codePoint : codePoints(myString)) {
....
}
Here\'s the helper mthod:
public static Iterable<Integer> codePoints(final String string) {
return new Iterable<Integer>() {
public Iterator<Integer> iterator() {
return new Iterator<Integer>() {
int nextIndex = 0;
public boolean hasNext() {
return nextIndex < string.length();
}
public Integer next() {
int result = string.codePointAt(nextIndex);
nextIndex += Character.charCount(result);
return result;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
}
};
}
Or alternately if you just want to convert a string to an array of int (which might use more RAM than the above approach):
public static List<Integer> stringToCodePoints(String in) {
if( in == null)
throw new NullPointerException(\"got null\");
List<Integer> out = new ArrayList<Integer>();
final int length = in.length();
for (int offset = 0; offset < length; ) {
final int codepoint = in.codePointAt(offset);
out.add(codepoint);
offset += Character.charCount(codepoint);
}
return out;
}
Thankfully uses \"codePoints\" safely handles the surrogate pair-ness of UTF-16 (java\'s internal string representation).