The Invulnerable XMLException

2019-02-17 18:46发布

问题:

Background

I serialize a very large List<string> using this code:

public static string SerializeObjectToXML<T>(T item)
{
    XmlSerializer xs = new XmlSerializer(typeof(T));
    using (StringWriter writer = new StringWriter())
    {
        xs.Serialize(writer, item);
        return writer.ToString();
    }
}

And deserialize it using this code:

public static T DeserializeXMLToObject<T>(string xmlText)
{
    if (string.IsNullOrEmpty(xmlText)) return default(T);
    XmlSerializer xs = new XmlSerializer(typeof(T));
    using (MemoryStream memoryStream = new MemoryStream(new UnicodeEncoding().GetBytes(xmlText.Replace((char)0x1A, ' '))))
    using (XmlTextReader xsText = new XmlTextReader(memoryStream))
    {
        xsText.Normalization = true;
        return (T)xs.Deserialize(xsText);
    }
}

But I get this exception when I deserialize it:

XMLException: There is an error in XML document (217388, 15). '[]', hexadecimal value 0x1A, is an invalid character. Line 217388, position 15.

at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, String encodingStyle, XmlDeserializationEvents events)

at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader)

Question

Why is the xmlText.Replace((char)0x1A, ' ') line not working, what witchery is this?

Some Constraints

  • My code is in C#, framework 4, built in VS2010 Pro.
  • I can't view the value of xmlText in debug mode because the List<string> is too big and the watch windows just displays the Unable to evaluate the expression. Not enough storage is available to complete this operation. error message.

回答1:

I think I've found the problem. By default, XmlSerializer will allow you to generate invalid XML.

Given the code:

var input = "\u001a";

var writer = new StringWriter();
var serializer = new XmlSerializer(typeof(string));
serializer.Serialize(writer, input);

Console.WriteLine(writer.ToString());

The output is:

<?xml version="1.0" encoding="utf-16"?>
<string>&#x1A;</string>

This is invalid XML. According to the XML specification, all character references must be to characters which are valid. Valid characters are:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

As you can see, U+001A (and all other C0/C1 control characters) are not allowed as references, since they are not valid characters.

The error message given by the decoder is a bit misleading, and would be clearer if it said that there was an invalid character reference.

There are several options for what you can do.

1) Don't let the XmlSerializer create invalid documents in the first place

You can use an XmlWriter, which by default will not allow invalid characters:

var input = "\u001a";

var writer = new StringWriter();
var serializer = new XmlSerializer(typeof(string));

// added following line:
var xmlWriter = XmlWriter.Create(writer);

// then, write via the xmlWriter rather than writer:
serializer.Serialize(xmlWriter, input);

Console.WriteLine(writer.ToString());

This will throw an exception when the serialization occurs. This will have to be handled and an appropriate error shown.

This probably isn't useful for you because you have data already stored with these invalid characters.

or 2) Strip out references to this invalid character

That is, instead of .Replace((char)0x1a, ' '), which isn't actually replacing anything in your document at the moment, use .Replace("&#x1A;", " "). (This isn't case-insensitive, but it is what .NET generates. A more robust solution would be to use a case-insensitive regex.)


As an aside, XML 1.1 actually allows references to control characters, as long as they are references and not plain characters in the document. This would solve your problem apart from the fact that the .NET XmlSerializer doesn't support version 1.1.



回答2:

If you have existing data where you have serialised a class which contains characters which cannot subsequently be deserialised you can sanitise the data with the following method:

public static string SanitiseSerialisedXml(this string serialized)
{
    if (serialized == null)
    {
        return null;
    }

    const string pattern = @"&#x([0-9A-F]{1,2});";

    var sanitised = Regex.Replace(serialized, pattern, match =>
    {
        var value = match.Groups[1].Value;

        int characterCode;
        if (int.TryParse(value, NumberStyles.HexNumber, CultureInfo.InvariantCulture, out characterCode))
        {
            if (characterCode >= char.MinValue && characterCode <= char.MaxValue)
            {
                return XmlConvert.IsXmlChar((char)characterCode) ? match.Value : string.Empty;
            }
        }

        return match.Value;
    });

    return sanitised;
}

The preferable solution is to not allow serliazation on invalid characters at the point of serialization as per point 1 of Porges' answer. This code covers point 2 of Porges' answer (Strip out references to this invalid character) and strips out all invalid characters. The above code was written to solve a problem where we had stored serialized data in a database field so needed to fix legacy data and solving the problem at the point of serialization was not an option.



回答3:

This issue also plagued us when running into ASCII control characters (SYN, NAK, etc). There is a simple way to disable this if you are using XmlWriterSettings, just leverage XmlWriterSettings.CheckCharacters for conformance with XML 1.0 Characters specifications.

class Program
{
    static void Main(string[] args)
    {
        MyCustomType c = new MyCustomType();
        c.Description = string.Format("Something like this {0}", (char)22);
        var output = c.ToXMLString();
        Console.WriteLine(output);
    }
}

public class MyCustomType
{
    public string Description { get; set; }
    static readonly XmlSerializer xmlSerializer = new XmlSerializer(typeof(MyCustomType));
    public string ToXMLString()
    {
        var settings = new XmlWriterSettings() { Indent = true, OmitXmlDeclaration = true, CheckCharacters = false };
        StringBuilder sb = new StringBuilder();
        using (var writer = XmlWriter.Create(sb, settings))
        {
            xmlSerializer.Serialize(writer, this);
            return sb.ToString();
        }
    }
}

The output will include the encoded character as &#x16; instead of throwing the error:

Unhandled Exception: System.InvalidOperationException: There was an error generating the XML document. ---> System.ArgumentException: '▬', hexadecimal value 0x16, is an invalid character.
at System.Xml.XmlEncodedRawTextWriter.InvalidXmlChar(Int32 ch, Char* pDst, Boolean entitize) at System.Xml.XmlEncodedRawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
at System.Xml.XmlEncodedRawTextWriter.WriteString(String text)
at System.Xml.XmlEncodedRawTextWriterIndent.WriteString(String text)
at System.Xml.XmlWellFormedWriter.WriteString(String text)
at System.Xml.XmlWriter.WriteElementString(String localName, String ns, String value)
at System.Xml.Serialization.XmlSerializationWriter.WriteElementString(String localName, String ns, String value, XmlQualifiedName xsiType