I have a file containing many vowels with diacritics. I need to make these replacements:
- Replace ā, á, ǎ, and à with a.
- Replace ē, é, ě, and è with e.
- Replace ī, í, ǐ, and ì with i.
- Replace ō, ó, ǒ, and ò with o.
- Replace ū, ú, ǔ, and ù with u.
- Replace ǖ, ǘ, ǚ, and ǜ with ü.
- Replace Ā, Á, Ǎ, and À with A.
- Replace Ē, É, Ě, and È with E.
- Replace Ī, Í, Ǐ, and Ì with I.
- Replace Ō, Ó, Ǒ, and Ò with O.
- Replace Ū, Ú, Ǔ, and Ù with U.
- Replace Ǖ, Ǘ, Ǚ, and Ǜ with Ü.
I know I can replace them one at a time with this:
sed -i 's/ā/a/g' ./file.txt
Is there a more efficient way to replace all of these?
If you check the man page of the tool iconv
:
//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the
target character set, it can be approximated through one or several similarly looking characters.
so we could do :
kent$ cat test1
Replace ā, á, ǎ, and à with a.
Replace ē, é, ě, and è with e.
Replace ī, í, ǐ, and ì with i.
Replace ō, ó, ǒ, and ò with o.
Replace ū, ú, ǔ, and ù with u.
Replace ǖ, ǘ, ǚ, and ǜ with ü.
Replace Ā, Á, Ǎ, and À with A.
Replace Ē, É, Ě, and È with E.
Replace Ī, Í, Ǐ, and Ì with I.
Replace Ō, Ó, Ǒ, and Ò with O.
Replace Ū, Ú, Ǔ, and Ù with U.
Replace Ǖ, Ǘ, Ǚ, and Ǜ with Ü.
kent$ iconv -f utf8 -t ascii//TRANSLIT test1
Replace a, a, a, and a with a.
Replace e, e, e, and e with e.
Replace i, i, i, and i with i.
Replace o, o, o, and o with o.
Replace u, u, u, and u with u.
Replace u, u, u, and u with u.
Replace A, A, A, and A with A.
Replace E, E, E, and E with E.
Replace I, I, I, and I with I.
Replace O, O, O, and O with O.
Replace U, U, U, and U with U.
Replace U, U, U, and U with U.
This might work for you:
sed -i 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/' file
I like iconv
as it handles all accents variations :
cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt
For this the tr(1) command is for. For example:
tr 'āáǎàēéěèīíǐì...' 'aaaaeeeeiii...' <infile >outfile
You may have to check/change your LANG
environment variable to match the character set being used.
You can use something like this:
sed -e 's/[àâ]/a/g;s/[ọõ]/o/g;s/[í,ì]/i/g;s/[ê,ệ]/e/g'
just add more characters to [..] for your need.
You can use man iso_8859_1
(or your char set) or od -bc
to identify the the octal representation of the diacritic. Then use gawk
to do the replacing.
{ gsub(/\344/,"a"; print $0 }
This replaces ä
with a
.
This may not work. Just because your locale must be set!
use locale to set LC_ALL, for example:
export LC_ALL=en_US.iso88591
Note that the full list of locales is available through:
locale -a
If you, like me, need to replace the accents just in some special places of your file text, you can do that using this kind of regex
echo '{"doNotReplaceKey":"bábögêjírù","replaceValueKey":"bábögêjírù","anotherNotReplaceKey":"bábögêjírù"}' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[áâàãä]/replaceValueKey":"\1a/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[éêèë]/replaceValueKey":"\1e/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[íîìï]/replaceValueKey":"\1i/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[óôòõö]/replaceValueKey":"\1o/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[úûùü]/replaceValueKey":"\1u/g;ta'
Output
{"doNotReplaceKey":"bábögêjírù","replaceValueKey":"babogejiru","anotherNotReplaceKey":"bábögêjírù"}