Replace fullwidth punctuation characters with norm

This question already has an answer here:

Python: How can I replace full-width characters with half-width characters? 6 answers

file1 contains some ：s (that's fullwidth) I'd like to turn into regular :s (that's our regular colon). How do I do this in bash? Perhaps a python script?

标签： python bash unicode

5条回答

放我归山

2楼-- · 2019-05-09 14:16

You might want to look into Python's unicodedata.normalize().

It allows you to take a unicode string, and normalize it to a specific form, for example:

unicodedata.normalize('NFKC', thestring)

Here's a table of the different normalization forms from Unicode Standard Annex #15:

enter image description here

If you only want to replace specific characters, you could use unicode.translate().

>>> orig = u'\uFF1A:'
>>> table = {0xFF1A: u':'}
>>> print repr(orig)
>>> print repr(orig.translate(table))
u'\uFF1A:'
u'::'

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-05-09 14:27

In Python 2.x you can use the unicode.translate method to translate a single Unicode codepoint to 0, 1 or more codepoints, using

replacement_string = original_string.translate(table)

The following code sets up a translation table that will map the full-width equivalents of all ASCII graphic characters to their ASCII equivalents:

# ! is 0x21 (ASCII) 0xFF01 (full); ~ is 0x7E (ASCII) 0xFF5E (full)
table = dict((x + 0xFF00 - 0x20, unichr(x)) for x in xrange(0x21, 0x7F))

(reference: see Wikipedia)

If you want to treat spaces similarly, do table[0x3000] = u' '

0人赞添加讨论(0) 举报

来，给爷笑一个

4楼-- · 2019-05-09 14:29

With all due respect, python isn’t the right tool for this job; perl is:

perl -CSAD -i.orig -pe 'tr[：][:]' file1

perl -CSAD -i.orig -pe 'tr[\x{FF1A}][:]' file1

perl -CSAD -i.orig -Mcharnames=:full -pe 'tr[\N{FULLWIDTH COLON}][:]' file1

perl -CSAD -i.orig -Mcharnames=:full -pe 'tr[\N{FULLWIDTH EXCLAMATION MARK}\N{FULLWIDTH QUOTATION MARK}\{FULLWIDTH NUMBER SIGN}\N{FULLWIDTH DOLLAR SIGN}\N{FULLWIDTH PERCENT SIGN}\N{FULLWIDTH AMPERSAND}\{FULLWIDTH APOSTROPHE}\N{FULLWIDTH LEFT PARENTHESIS}\N{FULLWIDTH RIGHT PARENTHESIS}\N{FULLWIDTH ASTERISK}\N{FULLWIDTH PLUS SIGN}\N{FULLWIDTH COMMA}\N{FULLWIDTH HYPHEN-MINUS}\N{FULLWIDTH FULL STOP}\N{FULLWIDTH SOLIDUS}][\N{EXCLAMATION MARK}\N{QUOTATION MARK}\N{NUMBER SIGN}\N{DOLLAR SIGN}\N{PERCENT SIGN}\{AMPERSAND}\N{APOSTROPHE}\N{LEFT PARENTHESIS}\N{RIGHT PARENTHESIS}\N{ASTERISK}\N{PLUS SIGN}\N{COMMA}\{HYPHEN-MINUS}\N{FULL STOP}\N{SOLIDUS}]' file1

0人赞添加讨论(0) 举报

淡お忘

5楼-- · 2019-05-09 14:30

You could try tr:

cat file.ext | tr "：" ":" > file_new.ext

0人赞添加讨论(0) 举报

Root（大扎）

6楼-- · 2019-05-09 14:31

I'd agree that Python is not the most effective tool for this purpose. While the options presented so far are good, sed is another good tool to have around:

sed -i 's/\xEF\xBC\x9A/:/g' file.txt

The -i option causes sed to edit the file in place, as in tchrist's perl example. Note that \xEF\xBC\x9A is the UTF-8 equivalent of the UTF-16 value \xFF1A. This page is a useful reference in case you need to deal with different encodings of the same Unicode value.

0人赞添加讨论(0) 举报

Replace fullwidth punctuation characters with norm

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间