How to convert (transliterate) a string from utf8

2020-01-29 05:05发布

问题:

I have a string object

"with multiple characters and even special characters"

I am trying to use

UTF8Encoding utf8 = new UTF8Encoding();
ASCIIEncoding ascii = new ASCIIEncoding();

objects in order to convert that string to ascii. May I ask someone to bring some light to this simple task, that is hunting my afternoon.

EDIT 1: What we are trying to accomplish is getting rid of special characters like some of the special windows apostrophes. The code that I posted below as an answer will not take care of that. Basically

O'Brian will become O?Brian. where ' is one of the special apostrophes

回答1:

This was in response to your other question, that looks like it's been deleted....the point still stands.

Looks like a classic Unicode to ASCII issue. The trick would be to find where it's happening.

.NET works fine with Unicode, assuming it's told it's Unicode to begin with (or left at the default).

My guess is that your receiving app can't handle it. So, I'd probably use the ASCIIEncoder with an EncoderReplacementFallback with String.Empty:

using System.Text;

string inputString = GetInput();
var encoder = ASCIIEncoding.GetEncoder();
encoder.Fallback = new EncoderReplacementFallback(string.Empty);

byte[] bAsciiString = encoder.GetBytes(inputString);

// Do something with bytes...
// can write to a file as is
File.WriteAllBytes(FILE_NAME, bAsciiString);
// or turn back into a "clean" string
string cleanString = ASCIIEncoding.GetString(bAsciiString); 
// since the offending bytes have been removed, can use default encoding as well
Assert.AreEqual(cleanString, Default.GetString(bAsciiString));

Of course, in the old days, we'd just loop though and remove any chars greater than 127...well, those of us in the US at least. ;)



回答2:

I was able to figure it out. In case someone wants to know below the code that worked for me:

ASCIIEncoding ascii = new ASCIIEncoding();
byte[] byteArray = Encoding.UTF8.GetBytes(sOriginal);
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, byteArray);
string finalString = ascii.GetString(asciiArray);

Let me know if there is a simpler way o doing it.



回答3:

For anyone who likes Extension methods, this one does the trick for us.

using System.Text;

namespace System
{
    public static class StringExtension
    {
        private static readonly ASCIIEncoding asciiEncoding = new ASCIIEncoding();

        public static string ToAscii(this string dirty)
        {
            byte[] bytes = asciiEncoding.GetBytes(dirty);
            string clean = asciiEncoding.GetString(bytes);
            return clean;
        }
    }
}

(System namespace so it's available pretty much automatically for all of our strings.)



回答4:

Based on Mark's answer above (and Geo's comment), I created a two liner version to remove all ASCII exception cases from a string. Provided for people searching for this answer (as I did).

using System.Text;

// Create encoder with a replacing encoder fallback
var encoder = ASCIIEncoding.GetEncoding("us-ascii", 
    new EncoderReplacementFallback(string.Empty), 
    new DecoderExceptionFallback());

string cleanString = encoder.GetString(encoder.GetBytes(dirtyString)); 


回答5:

If you want 8 bit representation of characters that used in many encoding, this may help you.

You must change variable targetEncoding to whatever encoding you want.

Encoding targetEncoding = Encoding.GetEncoding(874); // Your target encoding
Encoding utf8 = Encoding.UTF8;

var stringBytes = utf8.GetBytes(Name);
var stringTargetBytes = Encoding.Convert(utf8, targetEncoding, stringBytes);
var ascii8BitRepresentAsCsString = Encoding.GetEncoding("Latin1").GetString(stringTargetBytes);