Postgresql regex to match uppercase, Unicode-aware

2020-04-10 02:47发布

The title sums it up pretty well. I'm looking for a regular expression matching Unicode uppercase character for the Postgres ~ operator. The obvious way doesn't work:

=> select 'A' ~ '[[:upper:]]';
 ?column? 
----------
 t
(1 row)

=> select 'Ó' ~ '[[:upper:]]';
 ?column? 
----------
 t
(1 row)

=> select 'Ą' ~ '[[:upper:]]';
 ?column? 
----------
 f
(1 row)

I'm using Postgresql 9.1 and my locale is set to pl_PL.UTF-8. The ordering works fine.

=> show LC_CTYPE;
  lc_ctype   
-------------
 pl_PL.UTF-8
(1 row)

标签： regex postgresql unicode

2条回答

走好不送

2楼-- · 2020-04-10 03:06

I've found that perl regular expressions handles Unicode perfectly.

create extension plperl;

create function is_letter_upper(text) returns boolean
immutable strict language plperl
as $$
    use feature 'unicode_strings';
    return $_[0] =~ /^\p{IsUpper}$/ ? "true" : "false";
$$;

Tested on postgres 9.2 with perl 5.16.2.

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2020-04-10 03:12

The regexp engine of PG 9.1 and older versions does not correctly classify characters whose codepoint doesn't fit it one byte. The codepoint of 'Ó' being 211 it gets it right, but the codepoint of 'Ą' is 260, beyond 255.

PG 9.2 is better at this, though still not 100% right for all alphabets. See this commit in PostgreSQL source code, and particularly these parts of the comment:

remove the hard-wired limitation to not consider wctype.h results for character codes above 255

and

Still, we can push it up to U+7FF (which I chose as the limit of 2-byte UTF8 characters), which will at least make Eastern Europeans happy pending a better solution

Unfortunately this was not backported to 9.1

0人赞添加讨论(0) 举报

Postgresql regex to match uppercase, Unicode-aware

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间