How to Remove Diacritic Marks (such as Accents) us

2020-07-16 12:56发布

问题:

How can we remove diacritic marks from strings in BigQuery using the new normalize function such as:

café

To result:

cafe

回答1:

The Short Answer

It's actually quite simple after you understand what normalize is doing:

WITH data AS(
  SELECT 'Ãâíüçãõ' AS text
)

SELECT
  REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result,
  REGEXP_REPLACE(NORMALIZE(text, NFKD), r'\pM', '') nfkd_result
FROM data

Results:

Row   nfd_result    nfkd_result  
1     Aaiucao       Aaiucao  

You can use either the options "NFD" or "NFKD" and, for the most part, it should work (still you should understand the differences between both options to better deal with your data).


The More Complete Answer

Basically what normalize does is it converts all unicodes in a string to its canonical equivalent (or compatible form) so that we have equivalent reference for comparisons (now understanding this already requires knowing some concepts).

The point is, unicode not only establishes the mapping between numbers (their code points represented by U+) and their glyphs but also some rules of how these points might interact among themselves.

For instance, let's take the glyph á.

We don't have just one unicode for this character. We actually can represent it either like U+00E1 or like U+0061U+0301 which is the unicodes for a and ´.

Yeap! Unicode is defined in a way such that you can combine characters and diacritics and represent their union by just ordering one after the other.

In fact, you can play around with combining diacritics in Unicode using an online conversor:

Unicode defines these types of characters that can combine themselves to diacritics as precomposed characters by using a clever and simple idea: characters that are not precomposed have what is called a 0 (zero) combining class; points that can combine receive a positive combining class (for instance, ´ has class 230) which is used to assert how the final glyph should be represented.

This is quite cool but ends up creating a problem which explains the function normalize we've been discussing since the beginning: if we read two strings, one with unicode U+0061U+0301 and other with U+00E1 (both á), they should be considered equivalent! In fact, it's the same glyph represented in different ways.

This is precisely what normalize is doing. Unicode defines a canonical form for each character so that, when normalized, the end result should be such that if we have two strings with distinct code points for same glyph, we still can see both as equal.

Well, there are basically 2 main possibilities for how we can normalize code points: either composing different unicodes into just one (in our example this would be transforming U+0061U+0301 into U+00E1) or we can decompose (which would be the other way around, transforming U+00E1 into U+0061U+0301).

Here you can see it more clearly:

NF means the canonical equivalent. NFC means to retrieve the canonical composite character (united); NFD is the opposite, decomposes the character.

You can use this information to play around in BigQuery:

WITH data AS(
  SELECT 'Amélie' AS text
)

SELECT
  text,
  TO_CODE_POINTS(NORMALIZE(text, NFC)) nfc_result,
  TO_CODE_POINTS(NORMALIZE(text, NFD)) nfd_result
FROM data

Which results:

Notice the nfd column has one more code point. By now you already know what that is: ´ separated from the e.

If you read BigQuery's documentation for normalize, you'll see it also has support for the types NFKC and NFKD. This type (with letter K) does not normalize by canonical equivalence but rather by "compatibility", that is, it breaks some characters into its constituents letters as well, not only diacritics:

The letter (which is not the same as ffi. This type of character is known as ligature) is decomposed also by the letters that constitutes it (and therefore equivalence is lost as ffi may not be the same as ffi for some applications, hence the name compatibility form).

Now that we know how to decompose characters into the main glyph followed by its diacritic, we can use a regex to match only them to remove from the string (which is accomplished by the expression \pM which matches diacritics marks only):

WITH data AS(
  SELECT 'café' AS text
)

SELECT
  REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result
FROM data

And that's pretty much all there is (hopefully) to the normalize function and how it's used to remove diacritics. All this information I found thanks to user sigpwned and his answer to this question. As I tried it and it didn't quite work I decided to study some of the theory behind the methods and wanted to write it down :). Hopefully it'll be useful for more people as it definitely was for me.