I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.
I intend to use this unicode-list in a RegExp.
I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)
I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.
Anyways, all I want is a list of the accents. An example of a few are in the image below (but I would need ALL accents):
Thanks!
You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.
If you want the complete set (for all languages), you can do it problematically. You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want. Can't be more precise, because "accent" a bit vague :-) You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
And a script doing this would definitely be better than trying to mess with text editors. One of the characteristics of combining characters is that they combine :-) So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)
Here is the character class for Devanagari combining marks:
This is only the basic Devanagari block (not Devanagari Extended).