Replace fullwidth punctuation characters with norm

2019-05-09 13:58发布

This question already has an answer here:

file1 contains some s (that's fullwidth) I'd like to turn into regular :s (that's our regular colon). How do I do this in bash? Perhaps a python script?

5条回答
放我归山
2楼-- · 2019-05-09 14:16

You might want to look into Python's unicodedata.normalize().

It allows you to take a unicode string, and normalize it to a specific form, for example:

unicodedata.normalize('NFKC', thestring)

Here's a table of the different normalization forms from Unicode Standard Annex #15:

enter image description here


If you only want to replace specific characters, you could use unicode.translate().

>>> orig = u'\uFF1A:'
>>> table = {0xFF1A: u':'}
>>> print repr(orig)
>>> print repr(orig.translate(table))
u'\uFF1A:'
u'::'
查看更多
我欲成王,谁敢阻挡
3楼-- · 2019-05-09 14:27

In Python 2.x you can use the unicode.translate method to translate a single Unicode codepoint to 0, 1 or more codepoints, using

replacement_string = original_string.translate(table)

The following code sets up a translation table that will map the full-width equivalents of all ASCII graphic characters to their ASCII equivalents:

# ! is 0x21 (ASCII) 0xFF01 (full); ~ is 0x7E (ASCII) 0xFF5E (full)
table = dict((x + 0xFF00 - 0x20, unichr(x)) for x in xrange(0x21, 0x7F))

(reference: see Wikipedia)

If you want to treat spaces similarly, do table[0x3000] = u' '

查看更多
来,给爷笑一个
4楼-- · 2019-05-09 14:29

With all due respect, python isn’t the right tool for this job; perl is:

perl -CSAD -i.orig -pe 'tr[:][:]' file1

or

perl -CSAD -i.orig -pe 'tr[\x{FF1A}][:]' file1

or

perl -CSAD -i.orig -Mcharnames=:full -pe 'tr[\N{FULLWIDTH COLON}][:]' file1

or

perl -CSAD -i.orig -Mcharnames=:full -pe 'tr[\N{FULLWIDTH EXCLAMATION MARK}\N{FULLWIDTH QUOTATION MARK}\{FULLWIDTH NUMBER SIGN}\N{FULLWIDTH DOLLAR SIGN}\N{FULLWIDTH PERCENT SIGN}\N{FULLWIDTH AMPERSAND}\{FULLWIDTH APOSTROPHE}\N{FULLWIDTH LEFT PARENTHESIS}\N{FULLWIDTH RIGHT PARENTHESIS}\N{FULLWIDTH ASTERISK}\N{FULLWIDTH PLUS SIGN}\N{FULLWIDTH COMMA}\N{FULLWIDTH HYPHEN-MINUS}\N{FULLWIDTH FULL STOP}\N{FULLWIDTH SOLIDUS}][\N{EXCLAMATION MARK}\N{QUOTATION MARK}\N{NUMBER SIGN}\N{DOLLAR SIGN}\N{PERCENT SIGN}\{AMPERSAND}\N{APOSTROPHE}\N{LEFT PARENTHESIS}\N{RIGHT PARENTHESIS}\N{ASTERISK}\N{PLUS SIGN}\N{COMMA}\{HYPHEN-MINUS}\N{FULL STOP}\N{SOLIDUS}]' file1
查看更多
淡お忘
5楼-- · 2019-05-09 14:30

You could try tr:

cat file.ext | tr ":" ":" > file_new.ext
查看更多
Root(大扎)
6楼-- · 2019-05-09 14:31

I'd agree that Python is not the most effective tool for this purpose. While the options presented so far are good, sed is another good tool to have around:

sed -i 's/\xEF\xBC\x9A/:/g' file.txt

The -i option causes sed to edit the file in place, as in tchrist's perl example. Note that \xEF\xBC\x9A is the UTF-8 equivalent of the UTF-16 value \xFF1A. This page is a useful reference in case you need to deal with different encodings of the same Unicode value.

查看更多
登录 后发表回答