I'm in the process of creating a program that will scrub extended ASCII characters from text documents. I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.
Consider:
namespace ASCIITest
{
class Program
{
static void Main(string[] args)
{
string value = "Slide™1½”C4®";
byte[] asciiValue = Encoding.ASCII.GetBytes(value); // byte array
char[] array = value.ToCharArray(); // char array
Console.WriteLine("CHAR\tBYTE\tINT32");
for (int i = 0; i < array.Length; i++)
{
char letter = array[i];
byte byteValue = asciiValue[i];
Int32 int32Value = array[i];
//
Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
}
Console.ReadLine();
}
}
}
Output from program
CHAR BYTE INT32
S 83 83
l 108 108
i 105 105
d 100 100
e 101 101
T 63 8482 <- trademark symbol
1 49 49
½ 63 189 <- fraction
" 63 8221 <- smartquotes
C 67 67
4 52 52
r 63 174 <- registered trademark symbol
In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32
, but all show up as 63
when cast as the byte
value. What's going on here?
ASCII.GetBytes
conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63).So since your string contains characters outside of that range your
asciiValue
have?
instead of all interesting symbols like™
- itsChar
(Unicode) repesentation is 8482 which is indeed outside of 0-127 range.Converting string to char array does not modify values of characters and you still have original Unicode codes (
char
is essentiallyInt16
) - casting it to longer integer typeInt32
does not change the value.Below are possible conversion of that character into byte/integers:
Details available at ASCIIEncoding Class