I am trying to write a function, that translates a string containing unicode characters into some default ASCII transcription. Ideally I'd like e.g. Ångström
to become Angstroem
or, if that is not possible, Angstrom
. Likewise α=χ
should become a=x
(c?) or similar.
Does Emacs have such built-in capabilities? I know I can get the names and similar of characters (get-char-code-property
) but I know no built-in transcription table.
The purpose is to translate titles of entries into meaningfully readable filenames, avoiding problems with software that doesn't understand unicode.
My current strategy is to build a translation-table by hand, but this approach is fairly limited and requires a lot of maintenance.
There is no built-in capability that i know of. I wrote a package unidecode
specifically for your task. It uses the same approach as in Python's same-named library. To install just add MELPA repository to your repository list:
(add-to-list 'package-archives
'("melpa" . "http://melpa.milkbox.net/packages/") t)
Then run M-x package-install RET unidecode. unidecode
has 2 functions, unidecode-unidecode
that turns Unicode into ASCII, and unidecode-sanitize
that discards non-alphanumeric characters and transforms space into hyphen.
ELISP> (unidecode-unidecode "¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა")
"!Hola!, Gruss Gott, Hyvaa paivaa, Tere ohtust, Bongu Czesc!, Dobry den, Zdravstvuite!, Geia sas, lmsllmlllmckhmslmgll"
ELISP> (unidecode-sanitize "¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა")
"hola-gruss-gott-hyvaa-paivaa-tere-ohtust-bongu-czesc-dobry-den-zdravstvuite-geia-sas-lmsllmlllmckhmslmgll"