The title sums it up pretty well. I'm looking for a regular expression matching Unicode uppercase character for the Postgres ~ operator.
The obvious way doesn't work:
=> select 'A' ~ '[[:upper:]]';
?column?
----------
t
(1 row)
=> select 'Ó' ~ '[[:upper:]]';
?column?
----------
t
(1 row)
=> select 'Ą' ~ '[[:upper:]]';
?column?
----------
f
(1 row)
I'm using Postgresql 9.1 and my locale is set to pl_PL.UTF-8. The ordering works fine.
=> show LC_CTYPE;
lc_ctype
-------------
pl_PL.UTF-8
(1 row)
The regexp engine of PG 9.1 and older versions does not correctly classify characters whose codepoint doesn't fit it one byte.
The codepoint of 'Ó'
being 211 it gets it right, but the codepoint of 'Ą'
is 260, beyond 255.
PG 9.2 is better at this, though still not 100% right for all alphabets. See this commit in PostgreSQL source code, and particularly these parts of the comment:
remove the hard-wired limitation to not consider wctype.h results for
character codes above 255
and
Still, we can push it up to U+7FF (which I chose as the limit of
2-byte UTF8 characters), which will at least make Eastern Europeans
happy pending a better solution
Unfortunately this was not backported to 9.1
I've found that perl regular expressions handles Unicode perfectly.
create extension plperl;
create function is_letter_upper(text) returns boolean
immutable strict language plperl
as $$
use feature 'unicode_strings';
return $_[0] =~ /^\p{IsUpper}$/ ? "true" : "false";
$$;
Tested on postgres 9.2 with perl 5.16.2.