Dealing with invalid XML hexadecimal characters

2020-01-31 01:34发布

I'm trying to send an XML document over the wire but receiving the following exception:

"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
   at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
   at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
   at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
   at System.Xml.XmlRawWriter.WriteValue(String value)
   at System.Xml.XmlWellFormedWriter.WriteValue(String value)
   at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
   --- End of inner exception stack trace ---

I don't have any control over what I attempt to send because the string is gathered from an email. How can I encode my string so it's valid XML while keeping the illegal characters?

I'd like to keep the original characters one way or another.

标签: c# xml .net-3.5
8条回答
趁早两清
2楼-- · 2020-01-31 02:11

I'm on the receiving end of @parapurarajkumar's solution, where the illegal characters are being properly loaded into XmlDocument, but breaking XmlWriter when I'm trying to save the output.

My Context

I'm looking at exception/error logs from the website using Elmah. Elmah returns the state of the server at the time of the exception, in the form of a large XML document. For our reporting engine I pretty-print the XML with XmlWriter.

During a website attack, I noticed that some xmls weren't parsing and was receiving this '.', hexadecimal value 0x00, is an invalid character. exception.

NON-RESOLUTION: I converted the document to a byte[] and sanitized it of 0x00, but it found none.

When I scanned the xml document, I found the following:

...
<form>
...
<item name="SomeField">
   <value
     string="C:\boot.ini&#x0;.htm" />
 </item>
...

There was the nul byte encoded as an html entity &#x0; !!!

RESOLUTION: To fix the encoding, I replaced the &#x0; value before loading it into my XmlDocument, because loading it will create the nul byte and it will be difficult to sanitize it from the object. Here's my entire process:

XmlDocument xml = new XmlDocument();
details.Xml = details.Xml.Replace("&#x0;", "[0x00]");  // in my case I wanted to see it, otherwise just replace with ""
xml.LoadXml(details.Xml);

string formattedXml = null;

// I stuff this all in a helper function, but put it in-line for this example
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings {
    OmitXmlDeclaration = true,
    Indent = true,
    IndentChars = "\t",
    NewLineHandling = NewLineHandling.None,
};
using (XmlWriter writer = XmlWriter.Create(sb, settings)) {
    xml.Save(writer);
    formattedXml = sb.ToString();
}

LESSON LEARNED: sanitize for illegal bytes using the associated html entity, if your incoming data is html encoded on entry.

查看更多
萌系小妹纸
3楼-- · 2020-01-31 02:20
byte[] toEncodeAsBytes
            = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
      string returnValue
            = System.Convert.ToBase64String(toEncodeAsBytes);

is one way of doing this

查看更多
地球回转人心会变
4楼-- · 2020-01-31 02:24

Can't the string be cleaned with:

System.Net.WebUtility.HtmlDecode()

?

查看更多
家丑人穷心不美
5楼-- · 2020-01-31 02:25

Work for me:

XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };
查看更多
一夜七次
6楼-- · 2020-01-31 02:25

The following code removes XML invalid characters from a string and returns a new string without them:

public static string CleanInvalidXmlChars(string text) 
{ 
     // From xml spec valid chars: 
     // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
     // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
     string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; 
     return Regex.Replace(text, re, ""); 
}
查看更多
爱情/是我丢掉的垃圾
7楼-- · 2020-01-31 02:29

There is a generic solution that works nicely:

public class XmlTextTransformWriter : System.Xml.XmlTextWriter
{
    public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
    public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
    public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }

    public Func<string, string> TextTransform = s => s;

    public override void WriteString(string text)
    {
        base.WriteString(TextTransform(text));
    }

    public override void WriteCData(string text)
    {
        base.WriteCData(TextTransform(text));
    }

    public override void WriteComment(string text)
    {
        base.WriteComment(TextTransform(text));
    }

    public override void WriteRaw(string data)
    {
        base.WriteRaw(TextTransform(data));
    }

    public override void WriteValue(string value)
    {
        base.WriteValue(TextTransform(value));
    }
}

Once this is in place, you can then create your override of THIS as follows:

public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
{
    public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }

    void SetTransform()
    {
        TextTransform = XmlUtil.RemoveInvalidXmlChars;
    }
}

where XmlUtil.RemoveInvalidXmlChars is defined as follows:

    public static string RemoveInvalidXmlChars(string content)
    {
        if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
            return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
        else
            return content;
    }
查看更多
登录 后发表回答