I got a regular expression \p{L}\p{M}* which I use to split words into characters, this is particularly needed with hindi or thai words where the character can contains multiple 'characters' in them, such as मछली if split in a regular way in Java I get [म][छ][ल][ी] Where as I want [म][छ][ली]
I have been trying to improve this regular expression to include space characters as well so that when I split फार्म पशु I would get the followng groups [फा][र्][म][ ][प][शु]
But I haven't had any luck. Would anyone be able to help me out?
Also, if anyone has a alternative way of doing this is java that could be an alternative solution too. My current java code is
Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
Matcher matcher = pat.matcher(word);
while (matcher.find()) {
characters.add(matcher.group());
}