How to remove control chars from UTF8 string

2019-03-21 15:07发布

i have a VB.NET program that handles the content of documents. The programm handles high volumes of documents as "batch"(>2Million documents;total 1TB volume) Some of this documents may contain control chars or chars like f0e8(http://www.fileformat.info/info/unicode/char/f0e8/browsertest.htm).

Is there a easy and especially fast way to remove that chars?(except space,newline,tab,...) If the answer is regex: Has anyone a complete regex for me?

Thanks!

2条回答
可以哭但决不认输i
2楼-- · 2019-03-21 15:27

Here is the POSIX regex for control characters: [:cntrl:], from Regular Expression on Wikipedia.

查看更多
对你真心纯属浪费
3楼-- · 2019-03-21 15:49

Try

resultString = Regex.Replace(subjectString, "\p{C}+", "");

This will remove all "other" Unicode characters (control, format, private use, surrogate, and unassigned) from your string.

查看更多
登录 后发表回答