I'm calling File.ReadAllText()
in a program designed to format some files that I have.
Some of these files contain the ®
(174) symbol. However, when the text is being read, the returned string contains �
(65533) symbols where the ®
(174) should be.
What would cause this and how can I fix it?
This is likely due to a mismatch in the
Encoding
. Use the ReadAllText overload which allows you to specify the properEncoding
to use when reading the file.The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.
Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown
You need to specify the encoding when you call
File.ReadAllText
, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.
The character you are reading is the Replacement character
http://www.fileformat.info/info/unicode/char/fffd/index.htm
You are getting this because the actual encoding of the file does not match the encoding your program expects.
By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.