Fix html encoded text stored in the database

2019-09-17 07:12发布

问题:

I have a sql server db that has a table which stores a plain text value in a nvarchar column. Unfortunately there was a bug in the C# code that was running Encoder.HtmlEncode() on chinese characters before inserting it into the table . e.g text value of 您好 is being stored in the table as 您好

Is there any way I clean up this data using just T-sql? This database is heavily locked down, so I can't easily run any code against it other than T-sql.

回答1:

From what the problem seems to be, you have an option.

You could create a temp table that will store the HTML entity of the characters. As an example;

CREATE TABLE dbo.TempHost
{
    Entity varchar(255),
    Character nvarchar(255)

}

Then you can actually find the data as csv online (http://www.khngai.com/chinese/charmap/tbluni.php?page=0 or copy and paste to excel), and import it into the table. From there on, all you will need to do is to scan the data and call REPLACE() function and update.



回答2:

This is a fun challenge, and by fun I mean not really fun. T-SQL is quite bad at string manipulation. To make it even better, HTML entities actually encode a Unicode code point, and there is no simple way of converting that to a Unicode character in T-SQL.

Using a lookup table is probably the most viable method, in that it's likely to be more efficient than what I'm going to propose here: use a function to do the entity replacement. Warning: scalar-valued functions perform horribly in T-SQL and string manipulation is none too fast either. Nevertheless, I present this for, um, inspirational purposes:

CREATE FUNCTION dbo._ConvertEntities(@in NVARCHAR(MAX)) RETURNS NVARCHAR(MAX) AS BEGIN
  WHILE 1 = 1 BEGIN;
    DECLARE @entityStart INT = CHARINDEX('&#x', @in);
    IF @entityStart = 0 BREAK;
    DECLARE @entityEnd INT = CHARINDEX(';', @in, @entityStart)
    DECLARE @entity VARCHAR(MAX) = SUBSTRING(@in, @entityStart + LEN('&#x'), @entityEnd - @entityStart - LEN('&#x'));
    IF @entity NOT LIKE '[0-9A-F][0-9A-F][0-9A-F][0-9A-F]' RETURN @in;
    DECLARE @entityChar NCHAR(1) = CONVERT(NCHAR(1), CONVERT(BINARY(2), REVERSE(CONVERT(BINARY(2), @entity, 2))));
    SET @in = STUFF(@in, @entityStart, @entityEnd - @entityStart + 1, @entityChar);
  END;
  RETURN @in;
END;

Aside from performance issues, this function has the major shortcoming that it only works for entities of the form &#x????;, with ???? four hexadecimal digits. It fails quite badly for other entities (like those needing surrogates, those coded as decimal, or special entities like "). I've made it bail out in this case. Although it's fairly easy to extend it to handle single-byte entities, extending it to >4 would be agony.

Realistically, you want to do this in client software using a real programming language. Even if the database is sufficiently locked down that you cannot directly execute queries, you are presumably able to query data if it's not too much, and you can insert data back using generated statements (a lot of them if need be). Terribly slow, but more or less viable.

For completeness, I also mention the option of running CLR code in SQL Server using CLR integration. This requires that the server already allows this or that you can reconfigure it to allow it (improbable if it's "heavily locked down"). The main reason this would be attractive is because it's definitely easier and faster to decode the entities in CLR code, and using CLR integration means you're not using client code (so the data doesn't leave the server). On the other hand, since you need administrative access to the machine to deploy the assembly, this would seem to be a theoretical advantage at best. As far as performance goes, though, it probably can't be beat.



回答3:

You could take advantage of the fact the characters are being stored all start with "&#x" and are eight characters long. You could loop through the table updating cutting out the bad characters using something like the example below.

DECLARE @str VARCHAR(100)
SET @str = 'Hello 頶頴World'

DECLARE @pos int SELECT @pos = CHARINDEX('&#x', @str)

WHILE @pos > 0 
BEGIN
    SET @str = LEFT(@str, @pos -1) + RIGHT(@str, LEN(@str) -@pos - 8)
    SELECT @pos = CHARINDEX('&#x', @str) 
END 

SELECT @str


回答4:

HTML encoding is not the same as XML encoding, but thanks to this question, I've realized there is an embarrassingly simple way of achieving this:

SELECT
  REPLACE(
    CONVERT(NVARCHAR(MAX), 
      CONVERT(XML, 
        REPLACE(REPLACE(_column_, '<', '&lt;'), '"', '&quot;')
      )
    ),
    '&lt;', '<'
  )

Stick this in an UPDATE and you're done. Well, almost -- if the code contains non-XML escaped entities like &eacute;, you'd need to replace these separately. Also, we do need to dance around the issue of XML escaping (hence the &lt; replacing in case there's a < somewhere).

It may still need some refinement, but this sure looks a lot more promising than a scalar-valued function. :-)