I have a regular expression to get the initials of a name like below:
/\b\p{L}\./gu
it works fine with English and other languages until there are graphemes and combined charecters occur.
Like
क
in Hindi and
ಕ
in Kannada
are being matched
But,
के
this one in Hindi,
ಕೆ
this one in Kannada are notmatched with this regex.
I am trying to get the initials from a name like J.P.Morgan, etc.
Any help would be greatly appreciated.
You need to match diacritic marks after base letters using \p{M}*
:
'~\b(?<!\p{M})\p{L}\p{M}*\.~u'
The pattern matches
\b
- a word boundary
(?<!\p{M})
- the char before the current position must not be a diacritic char (without it, a match can occur within a single word)
\p{L}
- any base Unicode letter
\p{M}*
- 0+ diacritic marks
\.
- a dot.
See the PHP demo online:
$s = "क. ಕ. के. ಕೆ. ";
echo preg_replace('~\b(?<!\p{M})\p{L}\p{M}*+\.~u', '<pre>$0</pre>', $s);
// => <pre>क.</pre> <pre>ಕ.</pre> <pre>के.</pre> <pre>ಕೆ.</pre>