The title sums it up pretty well. I'm looking for a regular expression matching Unicode uppercase character for the Postgres ~ operator. The obvious way doesn't work:
=> select 'A' ~ '[[:upper:]]';
?column?
----------
t
(1 row)
=> select 'Ó' ~ '[[:upper:]]';
?column?
----------
t
(1 row)
=> select 'Ą' ~ '[[:upper:]]';
?column?
----------
f
(1 row)
I'm using Postgresql 9.1 and my locale is set to pl_PL.UTF-8. The ordering works fine.
=> show LC_CTYPE;
lc_ctype
-------------
pl_PL.UTF-8
(1 row)
I've found that perl regular expressions handles Unicode perfectly.
Tested on postgres 9.2 with perl 5.16.2.
The regexp engine of PG 9.1 and older versions does not correctly classify characters whose codepoint doesn't fit it one byte. The codepoint of
'Ó'
being 211 it gets it right, but the codepoint of'Ą'
is 260, beyond 255.PG 9.2 is better at this, though still not 100% right for all alphabets. See this commit in PostgreSQL source code, and particularly these parts of the comment:
and
Unfortunately this was not backported to 9.1