I'm developing a plugin for a CMS and have an unanticipated problem: because the plugin is multilang-enabled, input can be of any of the unicode character sets. The plugin saves data in json format, and contains objects with properties value
and lookup
. For value
everything is fine, but the lookup
property is used by PHP to retrieve these entities, and at certain points through regexes (content filters).
The problems are:
- For non-latin characters (eg. Экспорт), the
\w
(word-char) in a regex matches nothing. Is there any way to recognize cyrillic chars as word chars? Any other hidden catches? - The data format being JSON, non-latin characters are converted to JS unicodes, eg for the above:
\u042D\u043A\u0441\u043F\u043E\u0440\u0442
. Is it safe not to do this? (server restrictions etc.)
And the big 'design' question I have stems from the previous 2 problems:
Should I either allow users with non-Latin alphabet languages to use their own chars for the lookup
properties or should I force them to traditional 'word' chars, that is a,b,c etc. + underscore (thus an alphabet from another language)? I'd welcome a technical advice to guide this decision (not a UX one).
First question
You just have to turn on the
u
flag:Demo.
The PHP docs are misleading here:
I say it's misleading because from the ideone test above, it not only enables PCRE_UTF8 but also PCRE_UCP (Unicode Character Properties) which is the behavior you want here.
Here's what the PCRE docs say about it:
If you want to make it obvious at first sight the
PCRE_UCP
flag will be set, you can insert it into the pattern itself, at the start, like that:Second question
It's safe not to do this as long as your
Content-Type
header defines the right encoding.So you may want to use something like:
And make sure you actually send it in UTF8.
However, encoding these characters in escape sequences makes the whole thing ASCII compatible, so you basically eliminate the problem altogether in this way.
Design question
Technically, as long as your whole stack supports Unicode (Browser, PHP, Database etc) I see no problem with this approach. Just make sure to test it well and to use Unicode-enabled column types in your DB.
Be careful, PHP is a terrible language for string support, so you have to make sure you use the right functions (avoid non-Unicode aware ones like
strlen
etc unless you really want the byte count).It may be a bit more work to make sure everything works like it's supposed to, but if that's something you want to support there's no problem with that.