remove non-UTF-8 characters from xml with declared

2019-01-21 23:41发布

I have to handle this scenario in Java:

I'm getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 characters and there is a requirement to remove these characters from the xml on my side (legacy).

Let's consider an example where this invalid XML contains £ (pound).

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character? Any potential issues?

2) I get xml as an array of bytes - how to handle this operation safely in that case?

6条回答
Anthone
2楼-- · 2019-01-22 00:14
"test text".replaceAll("[^\\u0000-\\uFFFF]", "");

This code removes all 4-byte utf8 chars from string.This can be needed for some purposes while doing Mysql innodb varchar entry

查看更多
爷的心禁止访问
3楼-- · 2019-01-22 00:15

Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

If you have the byte array available than you could filter them even more properly with:

BufferedReader bufferedReader = null;
try {
    bufferedReader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String currentLine; (currentLine = bufferedReader.readLine()) != null;) {
        currentLine = currentLine.replaceAll(
                        "[\\x00-\\x7F]|" + //single-byte sequences   0xxxxxxx
                        "[\\xC0-\\xDF][\\x80-\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\\xE0-\\xEF][\\x80-\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\\xF0-\\xF7][\\x80-\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                , ""));
    }

For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.

查看更多
唯我独甜
4楼-- · 2019-01-22 00:16

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character?

I am assuming that you rather mean that you want to get rid of non-ASCII characters, because you're talking about a "legacy" side. You can get rid of anything outside the printable ASCII range using the following regex:

string = string.replaceAll("[^\\x20-\\x7e]", "");

2) I get xml as an array of bytes - how to handle this operation safely in that case?

You need to wrap the byte[] in an ByteArrayInputStream, so that you can read them in an UTF-8 encoded character stream using InputStreamReader wherein you specify the encoding and then use a BufferedReader to read it line by line.

E.g.

BufferedReader reader = null;
try {
    reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("[^\\x20-\\x7e]", "");
        // ...
    }
    // ...
查看更多
淡お忘
5楼-- · 2019-01-22 00:21

UTF-8 is an encoding; Unicode is a character set. But the GBP symbol is most definitely in the Unicode character set and therefore most certainly representable in UTF-8.

If you do in fact mean UTF-8, and you are actually trying to remove byte sequences that are not the valid encoding of a character in UTF-8, then...

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
ByteBuffer bytes = ...;
CharBuffer parsed = utf8Decoder.decode(bytes);
...
查看更多
爷的心禁止访问
6楼-- · 2019-01-22 00:24

I faced the same problem while reading files from a local directory and tried this:

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "UTF-8"));
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDom = db.parse(new InputSource(in));

You might have to use your network input stream instead of FileInputStream.

-- Kapil

查看更多
我命由我不由天
7楼-- · 2019-01-22 00:26

Note that the first step should be that you ask the creator of the XML (which is most likely a home grown "just print data" XML generator) to ensure that their XML is correct before sending to you. The simplest possible test if they use Windows is to ask them to view it in Internet Explorer and see the parsing error at the first offending character.

While they fix that, you can simply write a small program that change the header part to declare that the encoding is ISO-8859-1 instead:

<?xml version="1.0" encoding="iso-8859-1" ?>

and leave the rest untouched.

查看更多
登录 后发表回答