Is there a common regular expression that replaces all known special characters in non-English languages:
é, ô, ç, etc.
with English characters:
e, o, c, etc.
Is there a common regular expression that replaces all known special characters in non-English languages:
é, ô, ç, etc.
with English characters:
e, o, c, etc.
No, there is no such regex. Note that with a regex you "describe" a specific piece of text.
A certain regex implementation might provide the possibility to do replacements using regex, but these replacements are usually only performed by a single replacement: not replace
a
witha'
andb
withb'
etc.Perhaps the language you're working with has a method in its API to perform this kind of replacements, but it won't be using regex.
This task is what the
iconv
library is for. Find out how to use it in whichever language you're developing in.Chances are your library already has a binding for it
¡⅁uoɹʍ puɐ ⅂IɅƎ
This cannot be done, and you should not want to do it! It’s offensive to the whole world, and it’s naïve to the point of ignorance to believe that façade rhymes with arcade, or that Cañon City, Colorado falls under canon law.
You could run the string through Unicode Normalization Form D and discard mark characters, but I am certainly not going to tell you how because it is evil and wrong. It is evil for reasons already outlined, and it is wrong because there are zillion cases it doesn’t address at all.
Study Material
Here are what you need to read up on:
You MUST learn how to compare strings in a way that makes sense, and mutilating them simply never makes any sense whatso [pəʇələp] ever.
You must never just compare unnormalized strings code point by code point, and if possible you need to take the language into account, since rules differ between them.
Practical Examples
No matter the programming language you’re using, it may also help you to read the documentation for Perl’s Unicode::Normalize, Unicode::Collate, and Unicode::Collate::Locale modules.
For example, to search for
"MÜSS"
in a text that has"muß"
in it, you would do this:That will put
"muß"
into$match
.The
Unicode::Collate::Module
has support for tailoring to these locales:You have a choice: you can do this right, or you can not do it at all. No one will thank you if you do it wrong.
Doing it right means taking UAX#15 and UTS#10 into account.
Nothing less is acceptable in this day and age. It’s not the 1960s any more, you know!