Disable XML validation when using XDocument

2020-07-18 07:35发布

问题:

I'm parsing an XLIFF document using the XDocument class. Does XDocument perform some validation of the content which I read into it, and if so - is there any way to disable that validation?

I'm getting some weird errors if the XLIFF isn't valid XML (I don't care that it isn't, I just want to parse it).

E.g.

'.', hexadecimal value 0x00, is an invalid character. 

I'm currently reading the file like this:

string FileLocation = @"C:\XLIFF\text.xlf";
XDocument doc = XDocument.Load(FileLocation);

Thanks.

回答1:

I had similar problem which was fixed by letting StreamReader to read the content.

// this line throws exception like yours
XDocument xd = XDocument.Load(@"C:\test.xml");

// works
XDocument xd = XDocument.Load(new System.IO.StreamReader(@"C:\test.xml"));

If that does not help, try to include proper encoding.



回答2:

If you want to strip characters from strings that are invalid for use in XML, you can use this method:

private static string RemoveXmlInvalidCharacters(string s)
{
    return Regex.Replace(
        s,
        @"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]",
        string.Empty);
}

It removes any characters that fall outside of the set of valid character values, according to the XML standard.



回答3:

You can't parse invalid XML, because parsing requires a valid XML structure.
It might be the case that you read the file as ASCII when you should have read it as UTF-8 or UTF-16 and that leads to the problem you encountered.

Possible solution:
Read the file as UTF-8.



回答4:

XLIFF document is an XML document. Character 0x00 is not a valid XML character. Invalid XML is not an XML so you cannot read it using XML parsers.

Now well-formed is a different thing, you can use SAX parsers to read XML which is not well-formed but not Invalid XML.

Valid characters according to XML Specification:

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

UPDATE

Suggested solution: Pre-Process the files to remove invalid characters. Character \0 can be replaced with space unless it has a meaning (is binary) in which case it needs to come in Base64 format.