How to detect 4byte UTF8 characters in Oracle

2019-07-23 06:26发布

问题:

We have seen 4byte UTF-8 characters stored in oracle. We need to detect how many rows have such characters in a particular column.

Tried the recommendation given here - UTF-16 supplementary characters but that didn't work for me. It throws an error - ORA-12728: invalid range in regular expression. Tried the other answers too. None of them work.

Oracle documented the possible hex code ranges for UTF-8 in table B-2 in this page Unicode Character Set. I am trying to detect any character that falls in row 5 or 6 of that table ( "Supplementary characters: Additional Chinese, Japanese, and Korean characters; historic characters; musical symbols; mathematical symbols" OR "Private Use Area #2" )

Is there a regular expression I can use to detect such rows?

select count(*) from mytable where regexp_like(mycolumn, expression);

回答1:

May be you did something wrong with regexp building: There is short example.

-- create table: 
create table tmp_a as 
select unistr('\D841\DF0E') col from dual;
insert into tmp_a(col)
values(UNISTR('\D800\DC00'));
insert into tmp_a(col)
values(UNISTR('\D800\DC01'));
insert into tmp_a(col)
values(UNISTR('\D803\DC03'));
insert into tmp_a(col)
values(UNISTR('\041f'));
insert into tmp_a(col)
values('a');
insert into tmp_a(col)
values('b');


-- then check. There is should be 7 rows and only 4 should pass then "check"
select col, dump(col) , regexp_instr(col, '['||UNISTR('\F090\8080')||'-'||UNISTR('\F48F\BFBF')||']') as check from  tmp_a ;


-- finaly we could build next query with regexp_like as in your example 
select count(*) 
  from tmp_a 
 where regexp_like(col, '['||UNISTR('\F090\8080')||'-'||UNISTR('\F48F\BFBF')||']')

It works on Oracle 11.2.0.4 and 12.2.0.1



回答2:

Maybe it is easier to look for surrogates. High surrogate is from U+D800-U+DBFF, low surrogate is from U+DC00-U+DFFF.

This one should also work, is looks for the low surrogate:

WHERE REGEXP_LIKE(ASCIISTR(mycolumn), '\\D[C-F]')


标签: oracle utf-8