Dealing with invalid XML hexadecimal characters

2020-01-31 01:34发布

I'm trying to send an XML document over the wire but receiving the following exception:

"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
   at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
   at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
   at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
   at System.Xml.XmlRawWriter.WriteValue(String value)
   at System.Xml.XmlWellFormedWriter.WriteValue(String value)
   at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
   --- End of inner exception stack trace ---

I don't have any control over what I attempt to send because the string is gathered from an email. How can I encode my string so it's valid XML while keeping the illegal characters?

I'd like to keep the original characters one way or another.

标签: c# xml .net-3.5
8条回答
我欲成王,谁敢阻挡
2楼-- · 2020-01-31 02:30

The following solution removes any invalid XML characters, but it does so I think about as performantly as it could be done, and in particular, it does not allocate a new StringBuilder as well as a new string, not unless it is already determined that the string has any invalid characters in it. So the hot spot ends up being just a single for loop on the characters, with the check ending up being often no more than two greater than / lesser than numeric comparisons on each char. If none are found, it simply returns the original string. This is particularly helpful when the vast majority of strings are just fine to start with, it's nice to have these as in and out (with no wasted allocs etc) as quick as possible.

-- update --

See below how one can also directly write an XElement that has these invalid characters, though it uses this code --

Some of this code was influenced by Mr. Tom Bogle's solution here. See also on that same thread the helpful information in the post by superlogical. All of these, however, always instantiate a new StringBuilder and string still.

USAGE:

    string xmlStrBack = XML.ToValidXmlCharactersString("any string");

TEST:

    public static void TestXmlCleanser()
    {
        string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
        string goodString = "My name is Inigo Montoya!";

        string back1 = XML.ToValidXmlCharactersString(badString); // fixes it
        string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string

        XElement x1 = new XElement("test", back1);
        XElement x2 = new XElement("test", back2);
        XElement x3WithBadString = new XElement("test", badString);

        string xml1 = x1.ToString();
        string xml2 = x2.ToString().Print();

        string xmlShouldFail = x3WithBadString.ToString();
    }

// --- CODE --- (I have these methods in a static utility class called XML)

    /// <summary>
    /// Determines if any invalid XML 1.0 characters exist within the string,
    /// and if so it returns a new string with the invalid chars removed, else 
    /// the same string is returned (with no wasted StringBuilder allocated, etc).
    /// </summary>
    /// <param name="s">Xml string.</param>
    /// <param name="startIndex">The index to begin checking at.</param>
    public static string ToValidXmlCharactersString(string s, int startIndex = 0)
    {
        int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex);
        if (firstInvalidChar < 0)
            return s;

        startIndex = firstInvalidChar;

        int len = s.Length;
        var sb = new StringBuilder(len);

        if (startIndex > 0)
            sb.Append(s, 0, startIndex);

        for (int i = startIndex; i < len; i++)
            if (IsLegalXmlChar(s[i]))
                sb.Append(s[i]);

        return sb.ToString();
    }

    /// <summary>
    /// Gets the index of the first invalid XML 1.0 character in this string, else returns -1.
    /// </summary>
    /// <param name="s">Xml string.</param>
    /// <param name="startIndex">Start index.</param>
    public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0)
    {
        if (s != null && s.Length > 0 && startIndex < s.Length) {

            if (startIndex < 0) startIndex = 0;
            int len = s.Length;

            for (int i = startIndex; i < len; i++)
                if (!IsLegalXmlChar(s[i]))
                    return i;
        }
        return -1;
    }

    /// <summary>
    /// Indicates whether a given character is valid according to the XML 1.0 spec.
    /// This code represents an optimized version of Tom Bogle's on SO: 
    /// https://stackoverflow.com/a/13039301/264031.
    /// </summary>
    public static bool IsLegalXmlChar(char c)
    {
        if (c > 31 && c <= 55295)
            return true;
        if (c < 32)
            return c == 9 || c == 10 || c == 13;
        return (c >= 57344 && c <= 65533) || c > 65535;
        // final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose
        //c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception
    }

======== ======== ========

Write XElement.ToString directly

======== ======== ========

First, the usage of this extension method:

string result = xelem.ToStringIgnoreInvalidChars();

-- Fuller test --

    public static void TestXmlCleanser()
    {
        string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'

        XElement x = new XElement("test", badString);

        string xml1 = x.ToStringIgnoreInvalidChars();                               
        //result: <test>My name is Inigo Montoya</test>

        string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
        //result: <test>My name is Inigo Mont&#x1E;oya</test>
    }

--- code ---

    /// <summary>
    /// Writes this XML to string while allowing invalid XML chars to either be
    /// simply removed during the write process, or else encoded into entities, 
    /// instead of having an exception occur, as the standard XmlWriter.Create 
    /// XmlWriter does (which is the default writer used by XElement).
    /// </summary>
    /// <param name="xml">XElement.</param>
    /// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
    /// <param name="indent">Indent setting.</param>
    /// <param name="indentChar">Indent char (leave null to use default)</param>
    public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
    {
        if (xml == null) return null;

        StringWriter swriter = new StringWriter();
        using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {

            // -- settings --
            // unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
            writer.Formatting = indent ? Formatting.Indented : Formatting.None;

            if (indentChar != null)
                writer.IndentChar = (char)indentChar;

            // -- write --
            xml.WriteTo(writer); 
        }

        return swriter.ToString();
    }

-- this uses the following XmlTextWritter --

public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
{
    public bool DeleteInvalidChars { get; set; }

    public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
    {
        DeleteInvalidChars = deleteInvalidChars;
    }

    public override void WriteString(string text)
    {
        if (text != null && DeleteInvalidChars)
            text = XML.ToValidXmlCharactersString(text);
        base.WriteString(text);
    }
}
查看更多
走好不送
3楼-- · 2020-01-31 02:35

Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)

public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}

.Net Fiddle - https://dotnetfiddle.net/v1TNus

For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

查看更多
登录 后发表回答