I'm parsing an XLIFF document using the XDocument class. Does XDocument perform some validation of the content which I read into it, and if so - is there any way to disable that validation?
I'm getting some weird errors if the XLIFF isn't valid XML (I don't care that it isn't, I just want to parse it).
E.g.
'.', hexadecimal value 0x00, is an invalid character.
I'm currently reading the file like this:
string FileLocation = @"C:\XLIFF\text.xlf";
XDocument doc = XDocument.Load(FileLocation);
Thanks.
I had similar problem which was fixed by letting StreamReader to read the content.
// this line throws exception like yours
XDocument xd = XDocument.Load(@"C:\test.xml");
// works
XDocument xd = XDocument.Load(new System.IO.StreamReader(@"C:\test.xml"));
If that does not help, try to include proper encoding.
If you want to strip characters from strings that are invalid for use in XML, you can use this method:
private static string RemoveXmlInvalidCharacters(string s)
{
return Regex.Replace(
s,
@"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]",
string.Empty);
}
It removes any characters that fall outside of the set of valid character values, according to the XML standard.
You can't parse invalid XML, because parsing requires a valid XML structure.
It might be the case that you read the file as ASCII when you should have read it as UTF-8 or UTF-16 and that leads to the problem you encountered.
Possible solution:
Read the file as UTF-8.
XLIFF document is an XML document. Character 0x00 is not a valid XML character. Invalid XML is not an XML so you cannot read it using XML parsers.
Now well-formed is a different thing, you can use SAX parsers to read XML which is not well-formed but not Invalid XML.
Valid characters according to XML Specification:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
UPDATE
Suggested solution: Pre-Process the files to remove invalid characters. Character \0
can be replaced with space unless it has a meaning (is binary) in which case it needs to come in Base64 format.