I have a file with a lot of text, and mixed there are special space characters, those are Unicode Spaces
I need to replace all of them with the normal "space" character.
I have a file with a lot of text, and mixed there are special space characters, those are Unicode Spaces
I need to replace all of them with the normal "space" character.
Easy using perl:
but as @mklement0 corectly said in comment, it will match the
\t
(TAB) too. If this is problem, you could useDemo:
the above containing:
using:
note, for the demo replacing to underscore, prints
also, doable using pure bash
The
LC_ALL=en_US.UTF-8
is necessary only if your default locale isn'tUTF-8
. (which you should have, if do you working with utf8 texts) :) demo:prints again:
same using
sed
- prepare the variable$spaces
as above and use:Edit - because some strange copy/paste (or Locale) problems:
shows
the
md5
digest (two different programs)prints the same
md5
It is possible to identify the characters by their unicode, the
sed 's/[[:space:]]\+/\ /g'
wont do the trick unfortunately.By reworking another SO answer, we list all the unicodes save them in a variable, then use sed for the replacement (note using
-i.bak
we will also save a copy of the original file)If you use python3 this worked for me, its makeshift code but does work.
If you're faced with this task repeatedly, consider installing
nws
(normalize whitespace), a utility (of mine) that simplifies the task:The
--ascii
mode ofnws
:transliterates (non-ASCII) Unicode whitespace (such as a no-break space (
)) and punctuation (such as curly quotes (
“”
), en dash (–
), ... ) to their closest ASCII equivalentwhile leaving any other Unicode characters alone.
This mode is helpful for source-code samples that have been formatted for display with typographic quotes, em dashes, and the like, which usually makes the code indigestible to compilers/interpreters.
Installation of
nws
from the npm registry (Linux and macOS)Note: Even if you don't use Node.js,
npm
, its package manager, works across platforms and is easy to install; trycurl -L https://git.io/n-install | bash
With Node.js installed, install as follows:
Note:
sudo
depends on how you installed Node.js and whether you've changed permissions later; if you get anEACCES
error, try again withsudo
.-g
ensures global installation and is needed to putnws-cli
in your system's$PATH
.Manual installation (any Unix platform with
bash
)bash
script asnws
.chmod +x nws
.$PATH
, such as/usr/local/bin
(macOS) or/usr/bin
(Linux).Optional reading: POSIX character classes
[:space:]
and[:blank:]
and non-ASCII Unicode whitespaceIn UTF-8-based locales, POSIX-compatible utilities should make POSIX character class
[:space:]
and[:blank:]
match (non-ASCII) Unicode whitespace.This relies on the locale charmap's correct classification of Unicode characters based on the POSIX-mandated character classifications, which directly correspond to character classes such as
[:space:]
, available in patterns and regular expressions.There are two pitfalls:
Unicode is an evolving standard (version 9 as of this writing); your platform's UTF-8 charmap may not be current.
Ubuntu 16.04
the following characters are not properly classified and therefore not matched by[:space:]
/[:blank:]
:no-break space, figure space, narrow no-break space, next line
The utilities should use the active locale's charmap - but there are regrettable exceptions - the following utilities are NOT Unicode-aware (there may be more):
Among GNU utilities (as of coreutils v8.27):
cut
,tr
Mawk, the
awk
implementation that is the default on Ubuntu, for instance.Among BSD/macOS utilities (as of macOS 10.12):
awk
Therefore, on a platform that has a current UTF-8 charmap, the following
sed
command should work, but note that[:space:]
also matches tab characters and therefore replaces them with a single space too: