How to replace Unicode characters with ASCII

2020-06-17 15:00发布

I have the following command to replace Unicode characters with ASCII ones.

sed -i 's/Ã/A/g'

The problem is Ã isn't recognized by the sed command in my Unix environment so I'd assume you replace it with its hexadecimal value. What would the syntax look like if I were to use C3 instead?

I'm using this command as a template for other characters i'd like to replace with blank spaces such as:

sed -i 's/©/ /g'

标签： bash shell unix unicode sed

4条回答

冷血范

2楼-- · 2020-06-17 15:11

It is possible to use hex values in "sed".

echo "Ã" | hexdump -C
00000000  c3 83 0a                                          |...|
00000003

Ok, that character is two byte combination "c3 83". Let's replace it with single byte "A":

echo "Ã" |sed 's/\xc3\x83/A/g'
A

Explanation: \x indicates for "sed" that a hex code follows.

0人赞添加讨论(0) 举报

Deceive 欺骗

3楼-- · 2020-06-17 15:13

There is also uconv, from ICU.

Examples:

uconv -x "::NFD; [:Nonspacing Mark:] > ; ::NFC;": to remove accents
uconv -x "::Latin; ::Latin-ASCII;": for a transliteration latin/ascii
uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;": for a transliteration latin/ascii and removal of remaining code points > 0x7F
...

echo "À l'école ☠" | uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;" gives: A l'ecole

0人赞添加讨论(0) 举报

Ridiculous、

4楼-- · 2020-06-17 15:27

Try setting LANG=C and then run it over the Unicode range:
echo "hi ☠ there ☠" | LANG=C sed "s/[\x80-\xFF]//g"

0人赞添加讨论(0) 举报

混吃等死

5楼-- · 2020-06-17 15:32

You can use iconv:

iconv -f utf-8 -t ascii//translit

0人赞添加讨论(0) 举报

How to replace Unicode characters with ASCII

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间