I have files which contain non-printing characters such as \u2066-\u2069
(directional formatting) and \u2000-\u2009
(spaces of various widths, e.g.  
). Is it possible to remove (or replace) them by using a (Java) regex? (\\s+
does not work with the above). I don't want to build this myself as I don't know what characters I might get.
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
There is also a POSIX like
[^[:graph:]]
available. For one or more non visible characters, tryThe upper P indicates a negation of
\p{Graph}
and would match one or more[^\p{Alnum}\p{Punct}]
or[\p{Z}\p{C}]
. Downside is, that it's US-ASCII only according to the manual. If working with UTF-8 consider using inline flag(?U)
orUNICODE_CHARACTER_CLASS
.Just to mention, there is further
\P{Print}
available for non printable characters.All the characters you provided belong to the Separator, space Unicode category, so, you may use
The
Zs
Unicode category stands fro space separators of any kind (see more cateogry names in the documentation).To replace all horizontal whitespaces with a single regular ASCII space you may use
As per Java regex documentation,
If you want to shrink all Unicode whitespace to a single space
The
(?U)
is an embedded flag option equal to thePattern.UNICODE_CHARACTER_CLASS
option passed to thePattern.compile
method. Without it,\s
matches what\p{Space}
matches, i.e.[ \t\n\x0B\f\r]
. Once you pass(?U)
, it will start matching all whitespace chars in the Unicode table.To tokenize a string, you may split directly with