How to store UTF-8 bytes from a C# String in a SQL

2019-02-24 13:35发布

问题:

I have an existing SQL Server 2000 database that stores UTF-8 representations of text in a TEXT column. I don't have the option of modifying the type of the column, and must be able to store non-ASCII Unicode data from a C# program into that column.

Here's the code:

sqlcmd.CommandText =
    "INSERT INTO Notes " +
    "(UserID, LocationID, Note) " +
    "VALUES (" +
        Note.UserId.ToString() + ", " +
        Note.LocationID.ToString() + ", " +
        "@note); " +
    "SELECT CAST(SCOPE_IDENTITY() AS BIGINT) ";

SqlParameter noteparam = new SqlParameter( "@note", System.Data.SqlDbType.Text, int.MaxValue );

At this point I've tried a few different ways to get my UTF-8 data into the parameter. For example:

// METHOD ONE
byte[] bytes = (byte[]) Encoding.UTF8.GetBytes( Note.Note );
char[] characters = bytes.Select( b => (char) b ).ToArray();
noteparam.Value = new String( characters );

I've also tried simply

// METHOD TWO
noteparam.Value = Note.Note;

And

// METHOD THREE
byte[] bytes = (byte[]) Encoding.UTF8.GetBytes( Note.Note );
noteparam.Value = bytes;

Continuing, here's the rest of the code:

sqlcmd.Parameters.Add( noteparam );
sqlcmd.Prepare();

try
    {
    Note.RecordId = (Int64) sqlcmd.ExecuteScalar();
    }
catch
    {
    return false;
    }

Method one (get UTF8 bytes into a string) does something strange -- I think it is UTF-8 encoding the string a second time.

Method two stores garbage.

Method three throws an exception in ExecuteScalar() claiming it can't convert the parameter to a String.

Things I already know, so no need telling me:

  • SQL Server 2000 is past/approaching end-of-life
  • TEXT columns are not meant for Unicode text
  • Seriously, SQL Server 2000 is old. You need to upgrade.

Any suggestions?

回答1:

If your database collation is SQL_Latin1_General_CP1 (the default for the U.S. edition of SQL Server 2000), then you can use the following trick to store Unicode text as UTF-8 in a char, varchar, or text column:

byte[] bytes = Encoding.UTF8.GetBytes(Note.Note);
noteparam.Value = Encoding.GetEncoding(1252).GetString(bytes);

Later, when you want to read back the text, reverse the process:

SqlDataReader reader;
// ...
byte[] bytes = Encoding.GetEncoding(1252).GetBytes((string)reader["Note"]);
string note = Encoding.UTF8.GetString(bytes);

If your database collation is not SQL_Latin1_General_CP1, then you will need to replace 1252 with the correct code page.

Note: If you look at the stored text in Enterprise Manager or Query Analyzer, you'll see strange characters in place of non-ASCII text, just as if you opened a UTF-8 document in a text editor that didn't support Unicode.

How it works: When storing Unicode text in a non-Unicode column, SQL Server automatically converts the text from Unicode to the code page specified by the database collation. Any Unicode characters that don't exist in the target code page will be irreversibly mangled, which is why your first two methods didn't work.

But you were on the right track with method one. The missing step is to "protect" the raw UTF-8 bytes by converting them to Unicode using the Windows-1252 code page. Now, when SQL Server performs the automatic conversion from Unicode to Windows-1252, it gets back the original UTF-8 bytes untouched.