removing all non-printing characters by regex

I have files which contain non-printing characters such as \u2066-\u2069 (directional formatting) and \u2000-\u2009 (spaces of various widths, e.g.   ). Is it possible to remove (or replace) them by using a (Java) regex? (\\s+ does not work with the above). I don't want to build this myself as I don't know what characters I might get.

标签： java regex unicode

2条回答

▲ chillily

2楼-- · 2020-03-03 06:23

There is also a POSIX like [^[:graph:]] available. For one or more non visible characters, try

\P{Graph}+

The upper P indicates a negation of \p{Graph} and would match one or more [^\p{Alnum}\p{Punct}] or [\p{Z}\p{C}]. Downside is, that it's US-ASCII only according to the manual. If working with UTF-8 consider using inline flag (?U) or UNICODE_CHARACTER_CLASS.

Just to mention, there is further \P{Print} available for non printable characters.

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2020-03-03 06:38

All the characters you provided belong to the Separator, space Unicode category, so, you may use

s = s.replaceAll("\\p{Zs}+", " ");

The Zs Unicode category stands fro space separators of any kind (see more cateogry names in the documentation).

To replace all horizontal whitespaces with a single regular ASCII space you may use

s = s.replaceAll("\\h+", " ");

As per Java regex documentation,

\h A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]

If you want to shrink all Unicode whitespace to a single space

s = s.replaceAll("(?U)\\s+", " ");

The (?U) is an embedded flag option equal to the Pattern.UNICODE_CHARACTER_CLASS option passed to the Pattern.compile method. Without it, \s matches what \p{Space} matches, i.e. [ \t\n\x0B\f\r]. Once you pass (?U), it will start matching all whitespace chars in the Unicode table.

To tokenize a string, you may split directly with

String[] tokens = s.split("\\p{Zs}+");
String[] tokens = s.split("\\h+");
String[] tokens = s.split("(?U)\\s+");

0人赞添加讨论(0) 举报

removing all non-printing characters by regex

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间