We have seen 4byte UTF-8 characters stored in oracle. We need to detect how many rows have such characters in a particular column.
Tried the recommendation given here - UTF-16 supplementary characters but that didn't work for me. It throws an error - ORA-12728: invalid range in regular expression. Tried the other answers too. None of them work.
Oracle documented the possible hex code ranges for UTF-8 in table B-2 in this page Unicode Character Set. I am trying to detect any character that falls in row 5 or 6 of that table ( "Supplementary characters: Additional Chinese, Japanese, and Korean characters; historic characters; musical symbols; mathematical symbols" OR "Private Use Area #2" )
Is there a regular expression I can use to detect such rows?
select count(*) from mytable where regexp_like(mycolumn, expression);
May be you did something wrong with regexp building:
There is short example.
-- create table:
create table tmp_a as
select unistr('\D841\DF0E') col from dual;
insert into tmp_a(col)
values(UNISTR('\D800\DC00'));
insert into tmp_a(col)
values(UNISTR('\D800\DC01'));
insert into tmp_a(col)
values(UNISTR('\D803\DC03'));
insert into tmp_a(col)
values(UNISTR('\041f'));
insert into tmp_a(col)
values('a');
insert into tmp_a(col)
values('b');
-- then check. There is should be 7 rows and only 4 should pass then "check"
select col, dump(col) , regexp_instr(col, '['||UNISTR('\F090\8080')||'-'||UNISTR('\F48F\BFBF')||']') as check from tmp_a ;
-- finaly we could build next query with regexp_like as in your example
select count(*)
from tmp_a
where regexp_like(col, '['||UNISTR('\F090\8080')||'-'||UNISTR('\F48F\BFBF')||']')
It works on Oracle 11.2.0.4 and 12.2.0.1
Maybe it is easier to look for surrogates. High surrogate is from U+D800
-U+DBFF
, low surrogate is from U+DC00
-U+DFFF
.
This one should also work, is looks for the low surrogate:
WHERE REGEXP_LIKE(ASCIISTR(mycolumn), '\\D[C-F]')