When I use the fuzzystrmatch levenshtein function with diacritic characters it returns a wrong / multibyte-ignorant result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
(Note: the first character is an 'a' with a diacritic below, it is not rendered properly after I copied it here)
The fuzzystrmatch documentation (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html) warns that:
At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).
But as it does not name the levenshtein function, I was wondering if there is a multibyte aware version of levenshtein.
I know that I could use unaccent function as a workaround but I need to keep the diacritics.
The 'a' with a diacritic is a character sequence, i.e. a combination of a and a combining character, the diacritic ̨ :
E'a\u0328'
There is an equivalent precomposed character ą:
E'\u0105'
A solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence into the precomposed character before comparing them.
Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.
For example:
Now, as the character sequence
E'a\u0328'
is mapped onto the equivalent precomposed characterE'\u0105'
by usingunicode_normalize
, the levenshtein distance is correct: