Replace characters with multi-character strings

2019-02-25 13:12发布

问题:

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.

Is there a more elegant way to do that? Actually it looks like that (not completed yet):

SELECT addr, REPLACE (REPLACE(addr, 'ü','ue'),'ß','ss') FROM search;

On my way trying different commands I got another problem:

When I searched for Ü I got this:

ERROR: invalid byte sequence for encoding "UTF8": 0xdc27

Tried it with U&'\0220', it didn't replace anything. Only by using ü (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?

Kind regards from Germany. :)

回答1:

Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:

SHOW client_encoding;   -- in your actual session

And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL

The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.

As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.

As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.

Related:

  • Replace unicode characters in PostgreSQL
  • Regex remove all occurrences of multiple characters in a string


回答2:

Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.

In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:

ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"

dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"

cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
  from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))

I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

Thank you for your replies and kind regards.