I have Strings containing two similar looking characters. Both appear as small 'a's with an ogonek:
ą
ą
(Note: depending on the renderer they are sometimes rendered similarily, sometimes slightly differently)
However, they are different:
Characteristics of the 1st character:
In PostgreSQL:
select ascii('ą');
ascii
-------
261
The UTF-8-encoding in Hex is: \xC4\x85
so it is a precomposed character (https://en.wikipedia.org/wiki/Precomposed_character)
Characteristics of the 2nd character:
In PostgreSQL:
select ascii('ą');
ascii
-------
97
(same as character 'a')
That strongly indicates that the rendered character is combined out of two characters. And it is indeed:
The UTF-8-encoding in Hex is: \x61\xCC\xA8
So it is a combination of
a \x61\
and a combining character (https://en.wikipedia.org/wiki/Combining_character), the separate ogonek:
̨ \xCC\xA8
I want to use PostgreSQL's levenshtein function to determine the similarity of words, and so I want treat both characters as the same (as it is of course intended by people who write the name of a distinctive entity either with the 1st or the 2nd character).
I assumed that I can use unaccent to always get rid of the ogonek, but that is not working in the 2nd case:
1st character: expected result:
select levenshtein('ą', 'x');
levenshtein
-------------
1
1st character: expected result:
select levenshtein(unaccent('ą'), 'x');
levenshtein
-------------
1
2nd character: expected result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
2nd character: unexpected result:
select levenshtein(unaccent('ą'), 'x');
levenshtein
-------------
2
So, when I compare both characters with levenshtein and unaccent, the result is 1:
select levenshtein(unaccent('ą'), unaccent('ą'));
levenshtein
-------------
1
instead of 0.
How can I "get rid of the ogonek" in the 2nd case?
(How) can I use the UTF-8 codes of Strings to get the achieved result?
Edit: As @s-man suggested, adding the combining character to unaccent.rules
would solve this particular problem. But to generally solve the precomposed character vs. combined character problem with unaccent, I would have to explicitly add/modify every missing/"misconfigured" combined character to/in the config.
You have to change your configuration and add the missing characters manually at the config file as described in https://postgresql.org/docs/current/unaccent.html
Removing accents will give you a Levenshtein distance of 0, but it will also give you a distance of 0 between
ą
anda
, which does not sound ideal.The better solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence
E'a\u0328'
into the precomposed characterE'\u0105'
before comparing them.Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.
For example:
And then:
This also solves the issue in your previous question, where the combining character was contributing to the Levenshtein distance:
There are characters which are composed of multiple characters:
The goal is to map a "multiple character" character on the containing "basic" character.
(assuming the corresponding precomposed characters are mapped to the "basic" character, which is the case in the original
unaccent.rules
file)unaccent checks every character in a "multiple character" character for replacement, so it is not necessary to consider every combination of basic character and diacritic.
Instead the diacritics have to be mapped on [nothing]. This can be achieved by leaving the second column in the
unaccent.rules
file (https://postgresql.org/docs/current/unaccent.html) empty.This is a list of diacritics for the Latin alphabet obtained from https://en.wikipedia.org/wiki/Diacritic: ´ ˝ ` ̏ ˆ ˇ ˘ ̑ ¸ ¨ · ̡ ̢ ̉ ̛ ͅ ˉ ˛ ͂ ˚ ˳ ῾ ᾿
Add to that the ogonek of the question, which is missing: ̨
Now (after a PostgreSQL restart, of course), unaccent maps "multiple character" characters on the "basic" character, as it does with precomposed characters.
Note: The above list may not be comprehensive, but should at least solve a good part of the "precomposed character vs. combined character" problem.