i am beginning with a string containing an encoded unicode character "& #xfc;". I pass the string to an object that performs some logic and returns another string. That string is converting the original encoded character to its unicode equivalent "ü".
I need to get the original encoded character back but so far am not able.
I have tried using the HttpUtility.HtmlEncode() method but that is returning "& #252;" which is not the same.
Can anyone help?
They are pretty much the same, at least for display purposes. HttpUtility.HtmlEncode
is using decimal encoding, which is in the format &#DECIMAL;
while your original version is in hexadecimal encoding, i.e. in the format &#xHEX;
. Since fc
in hex is 252
in decimal, the two are equivalent.
If you really need to get the hex-encoded version, then consider parsing out the decimal and converting it to hex before stuffing it back in to the &#xHEX;
format. Something like
string unicode = "ü";
string decimalEncoded = HttpUtility.HtmlEncode(unicode);
int decimal = int.Parse(decimalEncoded.Substring(2, decimalEncoded.Length - 3);
string hexEncoded = string.Format("&#x{0:X};", decimal);
Or you can try this code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Web;
using System.Configuration;
using System.Globalization;
namespace SimpleCGIEXE
{
class Program
{
static string Uni2Html(string src)
{
string temp1 = HttpUtility.UrlEncodeUnicode(src);
string temp2 = temp1.Replace('+', ' ');
string res = string.Empty;
int pos1 = 0, pos2 = 0;
while (true){
pos2=temp2.IndexOf("%",pos1);
if (pos2 < 0) break;
if (temp2[pos2 + 1] == 'u')
{
res += temp2.Substring(pos1, pos2 - pos1);
res += "&#x";
res += temp2.Substring(pos2 + 2, 4);
res += ";";
pos1 = pos2 + 6;
}
else
{
res += temp2.Substring(pos1, pos2 - pos1);
string stASCII = temp2.Substring(pos2 + 1, 2);
byte[] pdASCII = new byte[1];
pdASCII[0] = byte.Parse(stASCII, System.Globalization.NumberStyles.AllowHexSpecifier);
res += Encoding.ASCII.GetString(pdASCII);
pos1 = pos2 + 3;
}
}
res += temp2.Substring(pos1);
return res;
}
static void Main(string[] args)
{
Console.WriteLine("Content-type: text/html;charset=utf-8\r\n");
String st = "Vietnamese string: Thử một xâu unicode @@ # ~ .^ % !";
Console.WriteLine(Uni2Html(st) + "<br>");
st = "A chinese string: 我爱你 (I love you)";
Console.WriteLine(Uni2Html(st) + "<br>");
}
}
}
I just had to sort this out yester day.
It's a bit more complicated than just looking at a single character. You need to roll your own HtmlEncode() method. Strings in the .Net world are UTF-16 encoded. Unicode codepoints (what an HTML numeric character reference identifies) are a 32-bit unsigned integer value. This is mostly an issue is you have to deal with characters outside Unicodes "basic multi-lingual plane".
This code should do what you want
using System;
using System.Configuration ;
using System.Globalization ;
using System.Collections.Generic ;
using System.Text;
namespace TestDrive
{
class Program
{
static void Main()
{
string src = "foo \uABC123 bar" ;
string converted = HtmlEncode(src) ;
return ;
}
static string HtmlEncode( string s )
{
//
// In the .Net world, strings are UTF-16 encoded. That means that Unicode codepoints greater than 0x007F
// are encoded in the string as 2-character digraphs. So to properly turn them into HTML numeric
// characeter references (decimal or hex), we first need to get the UTF-32 encoding.
//
uint[] utf32Chars = StringToArrayOfUtf32Chars( s ) ;
StringBuilder sb = new StringBuilder( 2000 ) ; // set a reasonable initial size for the buffer
// iterate over the utf-32 encoded characters
foreach ( uint codePoint in utf32Chars )
{
if ( codePoint > 0x0000007F )
{
// if the code point is greater than 0x7F, it gets turned into an HTML numerica character reference
sb.AppendFormat( "&#x{0:X};" , codePoint ) ; // hex escape sequence
//sb.AppendFormat( "&#{0};" , codePoint ) ; // decimal escape sequence
}
else
{
// if less than or equal to 0x7F, it goes into the string as-is,
// except for the 5 SGML/XML/HTML reserved characters. You might
// want to also escape all the ASCII control characters (those chars
// in the range 0x00 - 0x1F).
// convert the unit to an UTF-16 character
char ch = Convert.ToChar( codePoint ) ;
// do the needful.
switch ( ch )
{
case '"' : sb.Append( """ ) ; break ;
case '\'' : sb.Append( "'" ) ; break ;
case '&' : sb.Append( "&" ) ; break ;
case '<' : sb.Append( "<" ) ; break ;
case '>' : sb.Append( ">" ) ; break ;
default : sb.Append( ch.ToString() ) ; break ;
}
}
}
// return the escaped, utf-16 string back to the caller.
string encoded = sb.ToString() ;
return encoded ;
}
/// <summary>
/// Convert a UTF-16 encoded .Net string into an array of UTF-32 encoding Unicode chars
/// </summary>
/// <param name="s"></param>
/// <returns></returns>
private static uint[] StringToArrayOfUtf32Chars( string s )
{
Byte[] bytes = Encoding.UTF32.GetBytes( s ) ;
uint[] utf32Chars = (uint[]) Array.CreateInstance( typeof(uint) , bytes.Length / sizeof(uint) ) ;
for ( int i = 0 , j = 0 ; i < bytes.Length ; i += 4 , ++j )
{
utf32Chars[ j ] = BitConverter.ToUInt32( bytes , i ) ;
}
return utf32Chars ;
}
}
}
Hope this helps!