Making XmlReaderSettings CheckCharacters work for

2019-08-15 17:11发布

问题:

I have an xml string coming from Adobe PDF AcroForms, which apparently allows naming form fields starting with numeric characters. I'm trying to parse this string to an XDocument:

XDocument xDocument = XDocument.Parse(xmlString);

But whenever I encounter such a form field where the name starts with a numeric char, the xml parsing throws an XmlException:

Name cannot begin with the 'number' character

Other solutions I found were about using: XmlReaderSettings.CheckCharacters

using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString), new XmlReaderSettings() { CheckCharacters = false }))
{
    XDocument xDocument = XDocument.Load(xmlReader);
}

But this also didn't work. Some articles pointed out the reason as one of the points mentioned in MSDN article:

If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references.

So I tried using:

using(MemoryStream memoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(xmlString)))
using (XmlReader xmlReader = XmlReader.Create(memoryStream, new XmlReaderSettings() { CheckCharacters = false }))
{
    XDocument xDocument = XDocument.Load(xmlReader);
}

This also didn't work. Can any one please help me in figuring out how to parse an xml string that contains xml elements whose name starts with numeric characters? How is the flag XmlReaderSettings.CheckCharacters supposed to be used?

回答1:

You can't make standard XML parser parse your format even if it "looks like" XML, stop trying. Standard-compliant XML parsers are disallowed to parse invalid XML. This was a design decision, based on all the problems quirks mode caused with HTML parsing.

Writing your own parser isn't that hard. XML is very strict and, unless you need advanced features, the syntax is simple.

  1. LL parser can be written by hand. Both lexer and parser are simple.

  2. LR parser can be generated using ANTLR and a simple grammar. Most likely, you'll even find example XML garmmars.

  3. You can also just take either of .NET XML parsers' source code and remove validation you don't need. You can find both XmlDocument and XDocument in .NET Core's repository on GitHub.