bash - Remove all Unicode Spaces and replace with

2020-05-24 06:34发布

问题:

I have a file with a lot of text, and mixed there are special space characters, those are Unicode Spaces

I need to replace all of them with the normal "space" character.

回答1:

Easy using perl:

perl -CSDA -plE 's/\s/ /g' file

but as @mklement0 corectly said in comment, it will match the \t (TAB) too. If this is problem, you could use

perl -CSDA -plE 's/[^\S\t]/ /g'

Demo:

X             X

the above containing:

U+00058 X LATIN CAPITAL LETTER X
U+01680   OGHAM SPACE MARK
U+02002   EN SPACE
U+02003   EM SPACE
U+02004   THREE-PER-EM SPACE
U+02005   FOUR-PER-EM SPACE
U+02006   SIX-PER-EM SPACE
U+02007   FIGURE SPACE
U+02008   PUNCTUATION SPACE
U+02009   THIN SPACE
U+0200A   HAIR SPACE
U+0202F   NARROW NO-BREAK SPACE
U+0205F   MEDIUM MATHEMATICAL SPACE
U+03000   IDEOGRAPHIC SPACE
U+00058 X LATIN CAPITAL LETTER X

using:

perl -CSDA -plE 's/\s/_/g'  <<<"X             X"

note, for the demo replacing to underscore, prints

X_____________X

also, doable using pure bash

LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

while read -r line; do
    echo "${line//[$spaces]/ }"
done

The LC_ALL=en_US.UTF-8 is necessary only if your default locale isn't UTF-8. (which you should have, if do you working with utf8 texts) :) demo:

str="X             X"
echo "${str//[$spaces]/_}"

prints again:

X_____________X

same using sed - prepare the variable $spaces as above and use:

sed "s/[$spaces]/ /g" file

Edit - because some strange copy/paste (or Locale) problems:

xxd -ps <<<"$spaces"

shows

c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e2
8087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a

the md5 digest (two different programs)

md5sum <<<"$spaces"
LC_ALL=C md5 <<<"$spaces"

prints the same md5

35cf5e1d7a5f512031d18f3d2ec6612f  -
35cf5e1d7a5f512031d18f3d2ec6612f


回答2:

It is possible to identify the characters by their unicode, the sed 's/[[:space:]]\+/\ /g' wont do the trick unfortunately.

By reworking another SO answer, we list all the unicodes save them in a variable, then use sed for the replacement (note using -i.bak we will also save a copy of the original file)

 CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

 sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt 


回答3:

If you're faced with this task repeatedly, consider installing nws (normalize whitespace), a utility (of mine) that simplifies the task:

nws --ascii file # convert non-ASCII whitespace and punctuation to ASCII

nws --ascii -i file  # update file in place

The --ascii mode of nws:

  • transliterates (non-ASCII) Unicode whitespace (such as a no-break space ( )) and punctuation (such as curly quotes (“”), en dash (), ... ) to their closest ASCII equivalent

  • while leaving any other Unicode characters alone.

This mode is helpful for source-code samples that have been formatted for display with typographic quotes, em dashes, and the like, which usually makes the code indigestible to compilers/interpreters.


Installation of nws from the npm registry (Linux and macOS)

Note: Even if you don't use Node.js, npm, its package manager, works across platforms and is easy to install; try
curl -L https://git.io/n-install | bash

With Node.js installed, install as follows:

[sudo] npm install nws-cli -g

Note:

  • Whether you need sudo depends on how you installed Node.js and whether you've changed permissions later; if you get an EACCES error, try again with sudo.
  • The -g ensures global installation and is needed to put nws-cli in your system's $PATH.

Manual installation (any Unix platform with bash)

  • Download this bash script as nws.
  • Make it executable with chmod +x nws.
  • Move it or symlink it to a folder in your $PATH, such as /usr/local/bin (macOS) or /usr/bin (Linux).

Optional reading: POSIX character classes [:space:] and [:blank:] and non-ASCII Unicode whitespace

In UTF-8-based locales, POSIX-compatible utilities should make POSIX character class [:space:] and [:blank:] match (non-ASCII) Unicode whitespace.

This relies on the locale charmap's correct classification of Unicode characters based on the POSIX-mandated character classifications, which directly correspond to character classes such as [:space:], available in patterns and regular expressions.

There are two pitfalls:

  • Unicode is an evolving standard (version 9 as of this writing); your platform's UTF-8 charmap may not be current.

    • For instance, on Ubuntu 16.04 the following characters are not properly classified and therefore not matched by [:space:] / [:blank:]:
      no-break space, figure space, narrow no-break space, next line
  • The utilities should use the active locale's charmap - but there are regrettable exceptions - the following utilities are NOT Unicode-aware (there may be more):

    • Among GNU utilities (as of coreutils v8.27):

      • cut, tr
    • Mawk, the awk implementation that is the default on Ubuntu, for instance.

    • Among BSD/macOS utilities (as of macOS 10.12):

      • awk

Therefore, on a platform that has a current UTF-8 charmap, the following sed command should work, but note that [:space:] also matches tab characters and therefore replaces them with a single space too:

sed 's/[[:space:]]/ /g' file


回答4:

If you use python3 this worked for me, its makeshift code but does work.

FILENAME = 'File.txt'
OUTPUTNAME = 'Fixed.txt'
f = open(FILENAME, 'r+', encoding='utf8')
o = open(OUTPUTNAME, 'w+', encoding='utf8')
for line in f:
    for ch in line:
        if ch == '\u2003':
            ch = ' '
            o.write(ch)
        else:
            o.write(ch)
o.close()
f.close()