可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

The following will replace ASCII control characters (shorthand for [\\x00-\\x1F\\x7F]):

my_string.replaceAll(\"\\\\p{Cntrl}\", \"?\");

The following will replace all ASCII non-printable characters (shorthand for [\\p{Graph}\\x20]), including accented characters:

my_string.replaceAll(\"[^\\\\p{Print}]\", \"?\");

However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?

回答1:

my_string.replaceAll(\"\\\\p{C}\", \"?\");

See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

回答2:

Op De Cirkel is mostly right. His suggestion will work in most cases:

myString.replaceAll(\"\\\\p{C}\", \"?\");

But if myString might contain non-BMP codepoints then it\'s more complicated. \\p{C} contains the surrogate codepoints of \\p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It\'s possible this is a Java bug rather than intended behavior.

Using the other constituent categories is an option:

myString.replaceAll(\"[\\\\p{Cc}\\\\p{Cf}\\\\p{Co}\\\\p{Cn}]\", \"?\");

However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \\p{C}:

StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
    int codePoint = myString.codePointAt(offset);
    offset += Character.charCount(codePoint);

    // Replace invisible control characters and unused code points
    switch (Character.getType(codePoint))
    {
        case Character.CONTROL:     // \\p{Cc}
        case Character.FORMAT:      // \\p{Cf}
        case Character.PRIVATE_USE: // \\p{Co}
        case Character.SURROGATE:   // \\p{Cs}
        case Character.UNASSIGNED:  // \\p{Cn}
            newString.append(\'?\');
            break;
        default:
            newString.append(Character.toChars(codePoint));
            break;
    }
}

回答3:

You may be interested in the Unicode categories \"Other, Control\" and possibly \"Other, Format\" (unfortunately the latter seems to contain both unprintable and printable characters).

In Java regular expressions you can check for them using \\p{Cc} and \\p{Cf} respectively.

回答4:

methods in blow for your goal

public static String removeNonAscii(String str)
{
    return str.replaceAll(\"[^\\\\x00-\\\\x7F]\", \"\");
}

public static String removeNonPrintable(String str) // All Control Char
{
    return str.replaceAll(\"[\\\\p{C}]\", \"\");
}

public static String removeSomeControlChar(String str) // Some Control Char
{
    return str.replaceAll(\"[\\\\p{Cntrl}\\\\p{Cc}\\\\p{Cf}\\\\p{Co}\\\\p{Cn}]\", \"\");
}

public static String removeFullControlChar(String str)
{
    return removeNonPrintable(str).replaceAll(\"[\\\\r\\\\n\\\\t]\", \"\");
}

回答5:

I have used this simple function for this:

private static Pattern pattern = Pattern.compile(\"[^ -~]\");
private static String cleanTheText(String text) {
    Matcher matcher = pattern.matcher(text);
    if ( matcher.find() ) {
        text = text.replace(matcher.group(0), \"\");
    }
    return text;
}

Hope this is useful.

回答6:

Based on the answers by Op De Cirkel and noackjr, the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all \"invisible Unicode characters\" except whitespaces:

myString.trim.replaceAll(\"\\r\\n\", \"\\n\").replaceAll(\"\\r\", \"\\n\").replaceAll(\"[\\\\p{Cc}\\\\p{Cf}\\\\p{Co}\\\\p{Cn}&&[^\\\\s]]\", \"\")

Tested with Scala REPL.

回答7:

I have redesigned the code for phone numbers +9 (987) 124124 Extract digits from a string in Java

 public static String stripNonDigitsV2( CharSequence input ) {
    if (input == null)
        return null;
    if ( input.length() == 0 )
        return \"\";

    char[] result = new char[input.length()];
    int cursor = 0;
    CharBuffer buffer = CharBuffer.wrap( input );
    int i=0;
    while ( i< buffer.length()  ) { //buffer.hasRemaining()
        char chr = buffer.get(i);
        if (chr==\'u\'){
            i=i+5;
            chr=buffer.get(i);
        }

        if ( chr > 39 && chr < 58 )
            result[cursor++] = chr;
        i=i+1;
    }

    return new String( result, 0, cursor );
}

How can I replace non-printable Unicode characters

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

How can I replace non-printable Unicode characters

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮