How can we remove diacritic marks from strings in BigQuery using the new normalize
function such as:
café
To result:
cafe
How can we remove diacritic marks from strings in BigQuery using the new normalize
function such as:
café
To result:
cafe
It's actually quite simple after you understand what normalize is doing:
WITH data AS(
SELECT 'Ãâíüçãõ' AS text
)
SELECT
REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result,
REGEXP_REPLACE(NORMALIZE(text, NFKD), r'\pM', '') nfkd_result
FROM data
Results:
Row nfd_result nfkd_result
1 Aaiucao Aaiucao
You can use either the options "NFD" or "NFKD" and, for the most part, it should work (still you should understand the differences between both options to better deal with your data).
Basically what normalize does is it converts all unicodes in a string to its canonical equivalent (or compatible form) so that we have equivalent reference for comparisons (now understanding this already requires knowing some concepts).
The point is, unicode not only establishes the mapping between numbers (their code points represented by U+) and their glyphs but also some rules of how these points might interact among themselves.
For instance, let's take the glyph á
.
We don't have just one unicode for this character. We actually can represent it either like U+00E1
or like U+0061U+0301
which is the unicodes for a
and ´
.
Yeap! Unicode is defined in a way such that you can combine characters and diacritics and represent their union by just ordering one after the other.
In fact, you can play around with combining diacritics in Unicode using an online conversor:
Unicode defines these types of characters that can combine themselves to diacritics as precomposed characters by using a clever and simple idea: characters that are not precomposed have what is called a 0 (zero) combining class; points that can combine receive a positive combining class (for instance, ´
has class 230) which is used to assert how the final glyph should be represented.
This is quite cool but ends up creating a problem which explains the function normalize we've been discussing since the beginning: if we read two strings, one with unicode U+0061U+0301
and other with U+00E1
(both á
), they should be considered equivalent! In fact, it's the same glyph represented in different ways.
This is precisely what normalize
is doing. Unicode defines a canonical form for each character so that, when normalized, the end result should be such that if we have two strings with distinct code points for same glyph, we still can see both as equal.
Well, there are basically 2 main possibilities for how we can normalize code points: either composing different unicodes into just one (in our example this would be transforming U+0061U+0301
into U+00E1
) or we can decompose (which would be the other way around, transforming U+00E1
into U+0061U+0301
).
Here you can see it more clearly:
NF means the canonical equivalent. NFC means to retrieve the canonical composite character (united); NFD is the opposite, decomposes the character.
You can use this information to play around in BigQuery:
WITH data AS(
SELECT 'Amélie' AS text
)
SELECT
text,
TO_CODE_POINTS(NORMALIZE(text, NFC)) nfc_result,
TO_CODE_POINTS(NORMALIZE(text, NFD)) nfd_result
FROM data
Which results:
Notice the nfd
column has one more code point. By now you already know what that is: ´
separated from the e
.
If you read BigQuery's documentation for normalize, you'll see it also has support for the types NFKC and NFKD. This type (with letter K) does not normalize by canonical equivalence but rather by "compatibility", that is, it breaks some characters into its constituents letters as well, not only diacritics:
The letter ffi
(which is not the same as ffi. This type of character is known as ligature) is decomposed also by the letters that constitutes it (and therefore equivalence is lost as ffi may not be the same as ffi for some applications, hence the name compatibility form).
Now that we know how to decompose characters into the main glyph followed by its diacritic, we can use a regex
to match only them to remove from the string (which is accomplished by the expression \pM
which matches diacritics marks only):
WITH data AS(
SELECT 'café' AS text
)
SELECT
REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result
FROM data
And that's pretty much all there is (hopefully) to the normalize function and how it's used to remove diacritics. All this information I found thanks to user sigpwned and his answer to this question. As I tried it and it didn't quite work I decided to study some of the theory behind the methods and wanted to write it down :). Hopefully it'll be useful for more people as it definitely was for me.