Background
I serialize a very large List<string>
using this code:
public static string SerializeObjectToXML<T>(T item)
{
XmlSerializer xs = new XmlSerializer(typeof(T));
using (StringWriter writer = new StringWriter())
{
xs.Serialize(writer, item);
return writer.ToString();
}
}
And deserialize it using this code:
public static T DeserializeXMLToObject<T>(string xmlText)
{
if (string.IsNullOrEmpty(xmlText)) return default(T);
XmlSerializer xs = new XmlSerializer(typeof(T));
using (MemoryStream memoryStream = new MemoryStream(new UnicodeEncoding().GetBytes(xmlText.Replace((char)0x1A, ' '))))
using (XmlTextReader xsText = new XmlTextReader(memoryStream))
{
xsText.Normalization = true;
return (T)xs.Deserialize(xsText);
}
}
But I get this exception when I deserialize it:
XMLException: There is an error in XML document (217388, 15). '[]', hexadecimal value 0x1A, is an invalid character. Line 217388, position 15.
at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, String encodingStyle, XmlDeserializationEvents events)
at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader)
Question
Why is the xmlText.Replace((char)0x1A, ' ')
line not working, what witchery is this?
Some Constraints
- My code is in C#, framework 4, built in VS2010 Pro.
- I can't view the value of xmlText in debug mode because the
List<string>
is too big and the watch windows just displays theUnable to evaluate the expression. Not enough storage is available to complete this operation.
error message.
This issue also plagued us when running into ASCII control characters (SYN, NAK, etc). There is a simple way to disable this if you are using
XmlWriterSettings
, just leverageXmlWriterSettings.CheckCharacters
for conformance with XML 1.0 Characters specifications.The output will include the encoded character as

instead of throwing the error:I think I've found the problem. By default,
XmlSerializer
will allow you to generate invalid XML.Given the code:
The output is:
This is invalid XML. According to the XML specification, all character references must be to characters which are valid. Valid characters are:
As you can see, U+001A (and all other C0/C1 control characters) are not allowed as references, since they are not valid characters.
The error message given by the decoder is a bit misleading, and would be clearer if it said that there was an invalid character reference.
There are several options for what you can do.
1) Don't let the XmlSerializer create invalid documents in the first place
You can use an
XmlWriter
, which by default will not allow invalid characters:This will throw an exception when the serialization occurs. This will have to be handled and an appropriate error shown.
This probably isn't useful for you because you have data already stored with these invalid characters.
or 2) Strip out references to this invalid character
That is, instead of
.Replace((char)0x1a, ' ')
, which isn't actually replacing anything in your document at the moment, use.Replace("", " ")
. (This isn't case-insensitive, but it is what .NET generates. A more robust solution would be to use a case-insensitive regex.)As an aside, XML 1.1 actually allows references to control characters, as long as they are references and not plain characters in the document. This would solve your problem apart from the fact that the .NET XmlSerializer doesn't support version 1.1.
If you have existing data where you have serialised a class which contains characters which cannot subsequently be deserialised you can sanitise the data with the following method:
The preferable solution is to not allow serliazation on invalid characters at the point of serialization as per point 1 of Porges' answer. This code covers point 2 of Porges' answer (Strip out references to this invalid character) and strips out all invalid characters. The above code was written to solve a problem where we had stored serialized data in a database field so needed to fix legacy data and solving the problem at the point of serialization was not an option.