Postgresql levenshtein and precomposed character v

2019-07-10 14:29发布

I have Strings containing two similar looking characters. Both appear as small 'a's with an ogonek:

ą

(Note: depending on the renderer they are sometimes rendered similarily, sometimes slightly differently)

However, they are different:

Characteristics of the 1st character:

In PostgreSQL:

select ascii('ą');
ascii 
-------
261

The UTF-8-encoding in Hex is: \xC4\x85

so it is a precomposed character (https://en.wikipedia.org/wiki/Precomposed_character)

Characteristics of the 2nd character:

In PostgreSQL:

select ascii('ą');
ascii 
-------
97

(same as character 'a')

That strongly indicates that the rendered character is combined out of two characters. And it is indeed:

The UTF-8-encoding in Hex is: \x61\xCC\xA8

So it is a combination of

a \x61\

and a combining character (https://en.wikipedia.org/wiki/Combining_character), the separate ogonek:

̨ \xCC\xA8

I want to use PostgreSQL's levenshtein function to determine the similarity of words, and so I want treat both characters as the same (as it is of course intended by people who write the name of a distinctive entity either with the 1st or the 2nd character).

I assumed that I can use unaccent to always get rid of the ogonek, but that is not working in the 2nd case:

1st character: expected result:

select levenshtein('ą', 'x');
levenshtein 
-------------
       1

1st character: expected result:

select levenshtein(unaccent('ą'), 'x');
levenshtein 
-------------
       1

2nd character: expected result:

select levenshtein('ą', 'x');
levenshtein 
-------------
       2

2nd character: unexpected result:

select levenshtein(unaccent('ą'), 'x');
levenshtein 
-------------
       2

So, when I compare both characters with levenshtein and unaccent, the result is 1:

select levenshtein(unaccent('ą'), unaccent('ą'));
levenshtein 
-------------
       1

instead of 0.

How can I "get rid of the ogonek" in the 2nd case?

(How) can I use the UTF-8 codes of Strings to get the achieved result?

Edit: As @s-man suggested, adding the combining character to unaccent.rules would solve this particular problem. But to generally solve the precomposed character vs. combined character problem with unaccent, I would have to explicitly add/modify every missing/"misconfigured" combined character to/in the config.

3条回答
Explosion°爆炸
2楼-- · 2019-07-10 14:49

You have to change your configuration and add the missing characters manually at the config file as described in https://postgresql.org/docs/current/unaccent.html

查看更多
Evening l夕情丶
3楼-- · 2019-07-10 14:52

Removing accents will give you a Levenshtein distance of 0, but it will also give you a distance of 0 between ą and a, which does not sound ideal.

The better solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence E'a\u0328' into the precomposed character E'\u0105' before comparing them.

Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.

For example:

create extension plpythonu;

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;

And then:

test=# select levenshtein(unicode_normalize(E'a\u0328'), unicode_normalize(E'\u0105'));
 levenshtein
-------------
           0

This also solves the issue in your previous question, where the combining character was contributing to the Levenshtein distance:

test=# select levenshtein(unicode_normalize(E'a\u0328'), 'x');
 levenshtein
-------------
           1
查看更多
来,给爷笑一个
4楼-- · 2019-07-10 14:53

Note: This solution is based on @S-Man's suggestion to explicitly add missing characters to the unaccent.rules file.

Note: A prerequisite of this answer is that the relevant precomposed characters (https://en.wikipedia.org/wiki/Precomposed_character) are already mapped in the unaccent.rules file. If not, they have to be added also.

There are characters which are composed of multiple characters:

The goal is to map a "multiple character" character on the containing "basic" character.

(assuming the corresponding precomposed characters are mapped to the "basic" character, which is the case in the original unaccent.rules file)

unaccent checks every character in a "multiple character" character for replacement, so it is not necessary to consider every combination of basic character and diacritic.

Instead the diacritics have to be mapped on [nothing]. This can be achieved by leaving the second column in the unaccent.rules file (https://postgresql.org/docs/current/unaccent.html) empty.

This is a list of diacritics for the Latin alphabet obtained from https://en.wikipedia.org/wiki/Diacritic: ´ ˝ ` ̏ ˆ ˇ ˘ ̑ ¸ ¨ · ̡ ̢ ̉ ̛ ͅ ˉ ˛ ͂ ˚ ˳ ῾ ᾿

Add to that the ogonek of the question, which is missing: ̨

Now (after a PostgreSQL restart, of course), unaccent maps "multiple character" characters on the "basic" character, as it does with precomposed characters.

Note: The above list may not be comprehensive, but should at least solve a good part of the "precomposed character vs. combined character" problem.

查看更多
登录 后发表回答