Design decision: Matching cyrillic chars in JSON w

2019-08-15 00:27发布

I'm developing a plugin for a CMS and have an unanticipated problem: because the plugin is multilang-enabled, input can be of any of the unicode character sets. The plugin saves data in json format, and contains objects with properties value and lookup. For value everything is fine, but the lookup property is used by PHP to retrieve these entities, and at certain points through regexes (content filters). The problems are:

  1. For non-latin characters (eg. Экспорт), the \w (word-char) in a regex matches nothing. Is there any way to recognize cyrillic chars as word chars? Any other hidden catches?
  2. The data format being JSON, non-latin characters are converted to JS unicodes, eg for the above: \u042D\u043A\u0441\u043F\u043E\u0440\u0442. Is it safe not to do this? (server restrictions etc.)

And the big 'design' question I have stems from the previous 2 problems:

Should I either allow users with non-Latin alphabet languages to use their own chars for the lookup properties or should I force them to traditional 'word' chars, that is a,b,c etc. + underscore (thus an alphabet from another language)? I'd welcome a technical advice to guide this decision (not a UX one).

1条回答
做个烂人
2楼-- · 2019-08-15 00:39

First question

For non-latin characters (eg. Экспорт), the \w (word-char) in a regex matches nothing. Is there any way to recognize cyrillic chars as word chars? Any other hidden catches?

You just have to turn on the u flag:

preg_match("#^\w+$#u", $str);

Demo.

The PHP docs are misleading here:

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

I say it's misleading because from the ideone test above, it not only enables PCRE_UTF8 but also PCRE_UCP (Unicode Character Properties) which is the behavior you want here.

Here's what the PCRE docs say about it:

PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

PCRE_UCP
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

If you want to make it obvious at first sight the PCRE_UCP flag will be set, you can insert it into the pattern itself, at the start, like that:

preg_match("#(*UCP)^\w+$#u", $str);

Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.

Second question

The data format being JSON, non-latin characters are converted to JS unicodes, eg for the above: \u042D\u043A\u0441\u043F\u043E\u0440\u0442. Is it safe not to do this? (server restrictions etc.)

It's safe not to do this as long as your Content-Type header defines the right encoding.

So you may want to use something like:

header('Content-Type: application/json; charset=utf-8');

And make sure you actually send it in UTF8.

However, encoding these characters in escape sequences makes the whole thing ASCII compatible, so you basically eliminate the problem altogether in this way.

Design question

Should I either allow users with non-Latin alphabet languages to use their own chars for the lookup properties or should I force them to traditional 'word' chars, that is a,b,c etc. + underscore (thus an alphabet from another language)? I'd welcome a technical advice to guide this decision (not a UX one).

Technically, as long as your whole stack supports Unicode (Browser, PHP, Database etc) I see no problem with this approach. Just make sure to test it well and to use Unicode-enabled column types in your DB.

Be careful, PHP is a terrible language for string support, so you have to make sure you use the right functions (avoid non-Unicode aware ones like strlen etc unless you really want the byte count).

It may be a bit more work to make sure everything works like it's supposed to, but if that's something you want to support there's no problem with that.

查看更多
登录 后发表回答