PHP supports regular expressions in three ways:
- POSIX ERE, now removed in PHP 7+
- PCRE which is a core component, but not always multibyte safe
- Multibyte String, which is not enabled by default
Today the web is Unicode, and PHP is too since 5.6 because of i18n. While PHP itself is known to be abysmally bad in supporting Unicode, Intl provides access to the relieving ICU library.
To avoid the long wait for UString and repetition (and memory) when doin' it right, I prefer Intl and leave out iconv, Multibyte String along with DateTime, and rewrite most of the SBCS string functions to be multibyte safe. In that process some issues arise:
- locale formatting large numbers is problematic on 32 bit platforms (like NASes) when a database offers storage for 64 bit numbers. It can be solved by using numbers as string via BCMath.
- Intl wrapper has no support for ICU's regular expression functions, the Unicode variant of PCRE remains.
To use PCRE with Unicode syntax, PHP's buit-in PCRE has to be compiled and configured with Unicode support. On some systems it is not configured with Unicode, adding (*UTF8)
before the expression overrides configuration.
- have I missed a way to work with ICU's regular expression functions from PHP?
- are there any other pitfalls to take into account for Unicode PCRE?
- have I missed a reason why Multibyte String should be used?