Replace unicode characters in PostgreSQL

2019-01-26 11:31发布

问题:

Is it possible to replace all the occurrences of a given character (expressed in unicode) with another character (expressed in unicode) in a varchar field in PostgreSQL?

I tried something like this:

UPDATE mytable 
SET myfield = regexp_replace(myfield, '\u0050', '\u0060', 'g')

But it seems that it really writes the string '\u0060' in the field and not the character corresponding to that code.

回答1:

According to the PostgreSQL documentation on lexical structure, you should use U& syntax:

UPDATE mytable 
SET myfield = regexp_replace(myfield, U&'\0050', U&'\0060', 'g')

You can also use the PostgreSQL-specific escape-string form E'\u0050'. This will work on older versions than the unicode escape form does, but the unicode escape form is preferred for newer versions. This should show what's going on:

regress=> SELECT '\u0050', E'\u0050', U&'\0050';
 ?column? | ?column? | ?column? 
----------+----------+----------
 \u0050   | P        | P
(1 row)


回答2:

It should work with the "characters corresponding to that code" unless come client or other layer in the food-chain mangles your code!

Also, use translate() or replace() for this simple job. Much faster than regexp_replace(). translate() is also good for multiple simple replacements at a time.
And avoid empty updates with a WHERE clause. Much faster yet, and avoids table boat and additional VACUUM cost.

UPDATE mytable
SET    myfield  = translate(myfield, 'P', '`')  -- actual characters
WHERE  myfield <> translate(myfield, 'P', '`');

If you keep running into problems, use the encoding @mvp provided:

UPDATE mytable
SET   myfield =  translate(myfield, U&'\0050', U&'\0060')
WHERE myfield <> translate(myfield, U&'\0050', U&'\0060');