How do I split Tamil characters in a string?
When I use preg_match_all('/./u', $str, $results)
,
I get the characters "த", "ம", "ி", "ழ" and "்".
How do I get the combined characters "த", "மி" and "ழ்"?
How do I split Tamil characters in a string?
When I use preg_match_all('/./u', $str, $results)
,
I get the characters "த", "ம", "ி", "ழ" and "்".
How do I get the combined characters "த", "மி" and "ழ்"?
I think you should be able to use the
grapheme_extract
function to iterate over the combined characters (which are technically called "grapheme clusters").Alternatively, if you prefer the regex approach, I think you can use this:
where
\pL
means a Unicode "letter", and\pM
means a Unicode "mark".(Disclaimer: I have not tested either of these approaches.)
if I understand your question correctly, you've got a unicode string containing codepoints, and you want to convert this into an array of graphames?
I'm working on developing an open source Python library to do tasks like this for a Tamil Language website.
I haven't used PHP in a while, so I'll post the logic. You can take a look at the code in the amuthaa/TamilWord.py file's split_letters() function.
As ruakh mentioned, Tamil graphemes are constructed as codepoints.
The vowels (உயிர் எழுத்து), aytham (ஆய்த எழுத்து - ஃ) and all the combinations ((உயிர்-மெய் எழுத்து) in the 'a' column (அ வரி - i.e. க, ச, ட, த, ப, ற, ங, ஞ, ண, ந, ம, ன, ய, ர, ள, வ, ழ, ல) each use a single codepoint.
Every consonant is made up of two codepoints: the a-combination letter + the pulli. E.g. ப் = ப + ்
Every combination other than the a-combinations are also made up of two codepoints: the a-combination letter + a marking: e.g. பி = ப் + ி, தை = த் + ை
So if your logic is going to be something like this:
This of course assumes that your string is well-formed and you don't have things like two markings in a row.
Here's the Python code, in case you find it helpful. If you want to help us port this to PHP, please let me know as well: