How do you remove invalid hexadecimal characters f

2019-01-03 12:57发布

Is there any easy/general way to clean an XML based data source prior to using it in an XmlReader so that I can gracefully consume XML data that is non-conformant to the hexadecimal character restrictions placed on XML?

Note:

  • The solution needs to handle XML data sources that use character encodings other than UTF-8, e.g. by specifying the character encoding at the XML document declaration. Not mangling the character encoding of the source while stripping invalid hexadecimal characters has been a major sticking point.
  • The removal of invalid hexadecimal characters should only remove hexadecimal encoded values, as you can often find href values in data that happens to contains a string that would be a string match for a hexadecimal character.

Background:

I need to consume an XML-based data source that conforms to a specific format (think Atom or RSS feeds), but want to be able to consume data sources that have been published which contain invalid hexadecimal characters per the XML specification.

In .NET if you have a Stream that represents the XML data source, and then attempt to parse it using an XmlReader and/or XPathDocument, an exception is raised due to the inclusion of invalid hexadecimal characters in the XML data. My current attempt to resolve this issue is to parse the Stream as a string and use a regular expression to remove and/or replace the invalid hexadecimal characters, but I am looking for a more performant solution.

13条回答
做个烂人
2楼-- · 2019-01-03 13:42

Use this function to remove invalid xml characters.

public static string CleanInvalidXmlChars(string text)   
{   
       string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";   
       return Regex.Replace(text, re, "");   
} 
查看更多
放荡不羁爱自由
3楼-- · 2019-01-03 13:44

The above solutions seem to be for removing invalid characters prior to converting to XML.

Use this code to remove invalid XML characters from an XML string. eg. &x1A;

    public static string CleanInvalidXmlChars( string Xml, string XMLVersion )
    {
        string pattern = String.Empty;
        switch( XMLVersion )
        {
            case "1.0":
                pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F]);";
                break;
            case "1.1":
                pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);";
                break;
            default:
                throw new Exception( "Error: Invalid XML Version!" );
        }

        Regex regex = new Regex( pattern, RegexOptions.IgnoreCase );
        if( regex.IsMatch( Xml ) )
            Xml = regex.Replace( Xml, String.Empty );
        return Xml;
    }

http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/

查看更多
看我几分像从前
4楼-- · 2019-01-03 13:48
private static String removeNonUtf8CompliantCharacters( final String inString ) {
    if (null == inString ) return null;
    byte[] byteArr = inString.getBytes();
    for ( int i=0; i < byteArr.length; i++ ) {
        byte ch= byteArr[i]; 
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
            byteArr[i]=' ';
        }
    }
    return new String( byteArr );
}
查看更多
聊天终结者
5楼-- · 2019-01-03 13:49

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}
查看更多
仙女界的扛把子
6楼-- · 2019-01-03 13:49

Modified answer or original answer by Neolisk above.
Changes: of \0 character is passed, removal is done, rather than a replacement. also, made use of XmlConvert.IsXmlChar(char) method

    /// <summary>
    /// Replaces invalid Xml characters from input file, NOTE: if replacement character is \0, then invalid Xml character is removed, instead of 1-for-1 replacement
    /// </summary>
    public class InvalidXmlCharacterReplacingStreamReader : StreamReader
    {
        private readonly char _replacementCharacter;

        public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter)
            : base(fileName)
        {
            _replacementCharacter = replacementCharacter;
        }

        public override int Peek()
        {
            int ch = base.Peek();
            if (ch != -1 && IsInvalidChar(ch))
            {
                if ('\0' == _replacementCharacter)
                    return Peek(); // peek at the next one

                return _replacementCharacter;
            }
            return ch;
        }

        public override int Read()
        {
            int ch = base.Read();
            if (ch != -1 && IsInvalidChar(ch))
            {
                if ('\0' == _replacementCharacter)
                    return Read(); // read next one

                return _replacementCharacter;
            }
            return ch;
        }

        public override int Read(char[] buffer, int index, int count)
        {
            int readCount= 0, ch;

            for (int i = 0; i < count && (ch = Read()) != -1; i++)
            {
                readCount++;
                buffer[index + i] = (char)ch;
            }

            return readCount;
        }


        private static bool IsInvalidChar(int ch)
        {
            return !XmlConvert.IsXmlChar((char)ch);
        }
    }
查看更多
叛逆
7楼-- · 2019-01-03 13:52

It may not be perfect (emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.

/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
    if (inString == null) return null;

    StringBuilder newString = new StringBuilder();
    char ch;

    for (int i = 0; i < inString.Length; i++)
    {

        ch = inString[i];
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
        //if using .NET version prior to 4, use above logic
        if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
        {
            newString.Append(ch);
        }
    }
    return newString.ToString();

}
查看更多
登录 后发表回答