An invalid XML character (Unicode: 0xc) was found

2019-01-10 13:21发布

问题:

Parsing an XML file using the Java DOM parser results in:

[Fatal Error] os__flag_8c.xml:103:135: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

回答1:

There are a few characters that are dissallowed in XML documents, even when you encapsulate data in CDATA-blocks.

If you generated the document you will need to entity encode it or strip it out. If you have an errorneous document, you should strip away these characters before trying to parse it.

See dolmens answer in this thread: Invalid Characters in XML

Where he links to this article: http://www.w3.org/TR/xml/#charsets

Basically, all characters below 0x20 is disallowed, except 0x9 (TAB), 0xA (CR?), 0xD (LF?)



回答2:

public String stripNonValidXMLCharacters(String in) {
    StringBuffer out = new StringBuffer(); // Used to hold the output.
    char current; // Used to reference the current character.

    if (in == null || ("".equals(in))) return ""; // vacancy test.
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
        if ((current == 0x9) ||
            (current == 0xA) ||
            (current == 0xD) ||
            ((current >= 0x20) && (current <= 0xD7FF)) ||
            ((current >= 0xE000) && (current <= 0xFFFD)) ||
            ((current >= 0x10000) && (current <= 0x10FFFF)))
            out.append(current);
    }
    return out.toString();
}    


回答3:

The character 0x0C is be invalid in XML 1.0 but would be a valid character in XML 1.1. So unless the xml file specifies the version as 1.1 in the prolog it is simply invalid and you should complain to the producer of this file.



回答4:

This link has a java code which works perfectly fine.

http://blog.mark-mclaren.info/2007/02/invalid-xml-characters-when-valid-utf8_5873.html



回答5:

Whenever invalid xml character comes xml, it gives such error. When u open it in notepad++ it look like VT, SOH,FF like these are invalid xml chars. I m using xml version 1.0 and i validate text data before entering it in database by pattern

Pattern p = Pattern.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+"); 
retunContent = p.matcher(retunContent).replaceAll("");

It will ensure that no invalid special char will enter in xml



回答6:

You can filter all 'invalid' chars with a custom FilterReader class:

public class InvalidXmlCharacterFilter extends FilterReader {

    protected InvalidXmlCharacterFilter(Reader in) {
        super(in);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int read = super.read(cbuf, off, len);
        if (read == -1) return read;

        for (int i = off; i < off + read; i++) {
            if (!XMLChar.isValid(cbuf[i])) cbuf[i] = '?';
        }
        return read;
    }
}

And run it like this:

InputStream fileStream = new FileInputStream(xmlFile);
Reader reader = new BufferedReader(new InputStreamReader(fileStream, charset));
InvalidXmlCharacterFilter filter = new InvalidXmlCharacterFilter(reader);
InputSource is = new InputSource(filter);
xmlReader.parse(is);


回答7:

I faced a similar issue where XML was containing control characters. After looking into the code, I found that a deprecated class,StringBufferInputStream, was used for reading string content.

http://docs.oracle.com/javase/7/docs/api/java/io/StringBufferInputStream.html

This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.

I changed it to ByteArrayInputStream and it worked fine.



回答8:

For people who are reading byte array into String and trying to convert to object with JAXB, you can add "iso-8859-1" encoding by creating String from byte array like this:

String JAXBallowedString= new String(byte[] input, "iso-8859-1");

This would replace the conflicting byte to single-byte encoding which JAXB can handle. Obviously this solution is only to parse the xml.