This question already has an answer here:
file1
contains some :
s (that's fullwidth) I'd like to turn into regular :
s (that's our regular colon). How do I do this in bash? Perhaps a python script?
This question already has an answer here:
file1
contains some :
s (that's fullwidth) I'd like to turn into regular :
s (that's our regular colon). How do I do this in bash? Perhaps a python script?
You might want to look into Python's
unicodedata.normalize()
.It allows you to take a unicode string, and normalize it to a specific form, for example:
unicodedata.normalize('NFKC', thestring)
Here's a table of the different normalization forms from Unicode Standard Annex #15:
If you only want to replace specific characters, you could use
unicode.translate()
.In Python 2.x you can use the
unicode.translate
method to translate a single Unicode codepoint to 0, 1 or more codepoints, usingThe following code sets up a translation table that will map the full-width equivalents of all ASCII graphic characters to their ASCII equivalents:
(reference: see Wikipedia)
If you want to treat spaces similarly, do
table[0x3000] = u' '
With all due respect, python isn’t the right tool for this job; perl is:
or
or
or
You could try
tr
:I'd agree that Python is not the most effective tool for this purpose. While the options presented so far are good,
sed
is another good tool to have around:The -i option causes sed to edit the file in place, as in tchrist's perl example. Note that
\xEF\xBC\x9A
is the UTF-8 equivalent of the UTF-16 value\xFF1A
. This page is a useful reference in case you need to deal with different encodings of the same Unicode value.