regular expression to match name initials - PCRE

2020-04-19 05:49发布

问题:

I have a regular expression to get the initials of a name like below:

/\b\p{L}\./gu

it works fine with English and other languages until there are graphemes and combined charecters occur. Like
in Hindi and
in Kannada
are being matched
But,
के this one in Hindi,
ಕೆ this one in Kannada
are notmatched with this regex.
I am trying to get the initials from a name like J.P.Morgan, etc.
Any help would be greatly appreciated.

回答1:

You need to match diacritic marks after base letters using \p{M}*:

'~\b(?<!\p{M})\p{L}\p{M}*\.~u'

The pattern matches

  • \b - a word boundary
  • (?<!\p{M}) - the char before the current position must not be a diacritic char (without it, a match can occur within a single word)
  • \p{L} - any base Unicode letter
  • \p{M}* - 0+ diacritic marks
  • \. - a dot.

See the PHP demo online:

$s = "क. ಕ. के. ಕೆ. ";
echo preg_replace('~\b(?<!\p{M})\p{L}\p{M}*+\.~u', '<pre>$0</pre>', $s); 
// => <pre>क.</pre> <pre>ಕ.</pre> <pre>के.</pre> <pre>ಕೆ.</pre>