I am looking for a way to detect if a character in a java string "is a combining character" or not. For instance,
String khmerCombiningVowel =
new String(new byte[]{(byte) 0xe1,(byte) 0x9f,(byte) 0x80}, "UTF-8"); // unicode 17c0
represents a combining Khmer vowel sign. I have tried "\\p{InCombiningDiacriticalMarks}"
regex but it doesn't seem to apply to these particular combining characters. Or even if there is some comprehensive list of all unicode combining character blocks I might be able to make a regex for them?
According to Algorithm to check for combining characters in Unicode, there are a number of blocks for combining characters.
Java has a number of helpful functions, try:
(prints true in both cases)
In this case, the COMBINING_SPACING_MARK (and related regex
\p{gc=Mc}
) both refer to the Unicode category "Mark, Spacing Combining" which is basically any character that combines with a previous character while also adding width.Other regular expressions that may be useful:
\p{M}
for any kind of mark. If you want to use the CharactergetType()
constants, you can get the same behavior to that by checking if its type isCOMBINING_SPACING_MARK
orENCLOSING_MARK
, orNON_SPACING_MARK
.ENCLOSING_MARK is a surrounding character, like a circle--also adds width to the character it combines with.
NON_SPACING_MARK includes the Latin alphabet diacritical combining marks, etc. (Marks that basically go on top or below, and don't add any width to the character).