I have a file with a lot of text, and mixed there are special space characters, those are Unicode Spaces
I need to replace all of them with the normal "space" character.
I have a file with a lot of text, and mixed there are special space characters, those are Unicode Spaces
I need to replace all of them with the normal "space" character.
Easy using perl:
perl -CSDA -plE 's/\s/ /g' file
but as @mklement0 corectly said in comment, it will match the \t
(TAB) too. If this is problem, you could use
perl -CSDA -plE 's/[^\S\t]/ /g'
Demo:
X X
the above containing:
U+00058 X LATIN CAPITAL LETTER X
U+01680 OGHAM SPACE MARK
U+02002 EN SPACE
U+02003 EM SPACE
U+02004 THREE-PER-EM SPACE
U+02005 FOUR-PER-EM SPACE
U+02006 SIX-PER-EM SPACE
U+02007 FIGURE SPACE
U+02008 PUNCTUATION SPACE
U+02009 THIN SPACE
U+0200A HAIR SPACE
U+0202F NARROW NO-BREAK SPACE
U+0205F MEDIUM MATHEMATICAL SPACE
U+03000 IDEOGRAPHIC SPACE
U+00058 X LATIN CAPITAL LETTER X
using:
perl -CSDA -plE 's/\s/_/g' <<<"X X"
note, for the demo replacing to underscore, prints
X_____________X
also, doable using pure bash
LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")
while read -r line; do
echo "${line//[$spaces]/ }"
done
The LC_ALL=en_US.UTF-8
is necessary only if your default locale isn't UTF-8
. (which you should have, if do you working with utf8 texts) :)
demo:
str="X X"
echo "${str//[$spaces]/_}"
prints again:
X_____________X
same using sed
- prepare the variable $spaces
as above and use:
sed "s/[$spaces]/ /g" file
Edit - because some strange copy/paste (or Locale) problems:
xxd -ps <<<"$spaces"
shows
c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e2
8087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a
the md5
digest (two different programs)
md5sum <<<"$spaces"
LC_ALL=C md5 <<<"$spaces"
prints the same md5
35cf5e1d7a5f512031d18f3d2ec6612f -
35cf5e1d7a5f512031d18f3d2ec6612f
It is possible to identify the characters by their unicode, the sed 's/[[:space:]]\+/\ /g'
wont do the trick unfortunately.
By reworking another SO answer, we list all the unicodes save them in a variable, then use sed for the replacement (note using -i.bak
we will also save a copy of the original file)
CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")
sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt
If you're faced with this task repeatedly, consider installing nws
(normalize whitespace), a utility (of mine) that simplifies the task:
nws --ascii file # convert non-ASCII whitespace and punctuation to ASCII
nws --ascii -i file # update file in place
The --ascii
mode of nws
:
transliterates (non-ASCII) Unicode whitespace (such as a no-break space (
)) and punctuation (such as curly quotes (“”
), en dash (–
), ... ) to their closest ASCII equivalent
while leaving any other Unicode characters alone.
This mode is helpful for source-code samples that have been formatted for display with typographic quotes, em dashes, and the like, which usually makes the code indigestible to compilers/interpreters.
nws
from the npm registry (Linux and macOS)Note: Even if you don't use Node.js, npm
, its package manager, works across platforms and is easy to install; try
curl -L https://git.io/n-install | bash
With Node.js installed, install as follows:
[sudo] npm install nws-cli -g
Note:
sudo
depends on how you installed Node.js and whether you've changed permissions later; if you get an EACCES
error, try again with sudo
.-g
ensures global installation and is needed to put nws-cli
in your system's $PATH
.bash
)bash
script as nws
.chmod +x nws
.$PATH
, such as /usr/local/bin
(macOS) or /usr/bin
(Linux).[:space:]
and [:blank:]
and non-ASCII Unicode whitespaceIn UTF-8-based locales, POSIX-compatible utilities should make POSIX character class [:space:]
and [:blank:]
match (non-ASCII) Unicode whitespace.
This relies on the locale charmap's correct classification of Unicode characters based on the POSIX-mandated character classifications, which directly correspond to character classes such as [:space:]
, available in patterns and regular expressions.
There are two pitfalls:
Unicode is an evolving standard (version 9 as of this writing); your platform's UTF-8 charmap may not be current.
Ubuntu 16.04
the following characters are not properly classified and therefore not matched by [:space:]
/ [:blank:]
:The utilities should use the active locale's charmap - but there are regrettable exceptions - the following utilities are NOT Unicode-aware (there may be more):
Among GNU utilities (as of coreutils v8.27):
cut
, tr
Mawk, the awk
implementation that is the default on Ubuntu, for instance.
Among BSD/macOS utilities (as of macOS 10.12):
awk
Therefore, on a platform that has a current UTF-8 charmap, the following sed
command should work, but note that [:space:]
also matches tab characters and therefore replaces them with a single space too:
sed 's/[[:space:]]/ /g' file
If you use python3 this worked for me, its makeshift code but does work.
FILENAME = 'File.txt'
OUTPUTNAME = 'Fixed.txt'
f = open(FILENAME, 'r+', encoding='utf8')
o = open(OUTPUTNAME, 'w+', encoding='utf8')
for line in f:
for ch in line:
if ch == '\u2003':
ch = ' '
o.write(ch)
else:
o.write(ch)
o.close()
f.close()