Efficient Cleaning of Strings in a Table

2019-05-06 10:06发布

问题:

I'm currently working on a problem where certain characters need to be cleaned from strings that exist in a table. Normally I'd do a simple UPDATE with a replace, but in this case there are 32 different characters that need to be removed.

I've done some looking around and I can't find any great solutions for quickly cleaning strings that already exist in a table.

Things I've looked into:

  1. Doing a series of nested replaces

    This solution is do-able, but for 32 different replaces it would require either some ugly code, or hacky dynamic sql to build a huge series of replaces.

  2. PATINDEX and while loops

    As seen in this answer it is possible to mimic a kind of regex replace, but I'm working with a lot of data so I'm hesitant to even trust the improved solution to run in a reasonable amount of time when the data volume is large.

  3. Recursive CTEs

    I tried a CTE approuch to this problem, but it didn't run terribly fast once the number of rows got large.

For reference:

CREATE TABLE #BadChar(
    id int IDENTITY(1,1),
    badString nvarchar(10),
    replaceString nvarchar(10)

);

INSERT INTO #BadChar(badString, replaceString) SELECT 'A', '^';
INSERT INTO #BadChar(badString, replaceString) SELECT 'B', '}';
INSERT INTO #BadChar(badString, replaceString) SELECT 's', '5';
INSERT INTO #BadChar(badString, replaceString) SELECT '-', ' ';

CREATE TABLE #CleanMe(
    clean_id int IDENTITY(1,1),
    DirtyString nvarchar(20)
);

DECLARE @i int;
SET @i = 0;
WHILE @i < 100000 BEGIN
    INSERT INTO #CleanMe(DirtyString) SELECT 'AAAAA';
    INSERT INTO #CleanMe(DirtyString) SELECT 'BBBBB';
    INSERT INTO #CleanMe(DirtyString) SELECT 'AB-String-BA';
    SET @i = @i + 1
END;


WITH FixedString (Step, String, cid) AS (
    SELECT 1 AS Step, REPLACE(DirtyString, badString, replaceString), clean_id
    FROM #BadChar, #CleanMe
    WHERE id = 1

    UNION ALL

    SELECT Step + 1, REPLACE(String, badString, replaceString), cid
    FROM FixedString AS T1
    JOIN #BadChar AS T2 ON T1.step + 1 = T2.id
    Join #CleanMe AS T3 on T1.cid = t3.clean_id

)
SELECT String FROM FixedString WHERE step = (SELECT MAX(STEP) FROM FixedString);

DROP TABLE #BadChar;
DROP TABLE #CleanMe;
  1. Use a CLR

    It seems like this is a common solution many people use, but the environment I'm in doesn't make this a very easy one to embark on.

Are there any other ways to go about this I've over looked? Or any improvements upon the methods I've already looked into for this?

回答1:

Leveraging the idea from Alan Burstein's solution, you could do something like this, if you wanted to hard code the bad/replace strings. This would work for bad/replace strings longer than a single character as well.

CREATE FUNCTION [dbo].[CleanStringV1]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) WITH SCHEMABINDING AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM
 (VALUES
      ('A', '^')
    , ('B', '}')
    , ('s', '5')
    , ('-', ' ')
    ) t(badString, replaceString) 
 RETURN @string;
END;

Or, if you have a table containing the bad/replace strings, then

CREATE FUNCTION [dbo].[CleanStringV2]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM BadChar
 RETURN @string;
END;

These are case sensitive. You can remove the COLLATE bit if you want case insensitive. I did a few small tests, and these were not much slower than nested REPLACE. The first one with the hardcoded strings was a the faster of the two, and was nearly as fast as nested REPLACE.