What's the best way to identify hidden charact

2019-01-17 06:55发布

问题:

When trying to identify erroneous data (often needing manual review and removal), I'd like an easy way of seeing hidden characters, such as TAB, Space, Carriage return and Line feed. Is there a built-in way for this?

In a similar question here on stackoverflow, regarding Oracle, a DUMP(fieldname) function was suggested, but I don't know if that woud make things easier even if a corresponding function would exist in SQL Server, since I need to see the Characters in their context.

The best idea I could come up with was replacing the expected hidden characters with visible ones, like this:

SELECT REPLACE(REPLACE(REPLACE(REPLACE(myfield, ' ', '˙'), CHAR(13), '[CR]'), CHAR(10), '[LF]'), CHAR(9), '[TAB]') FROM mytable

Is there a better way? I don't like this way since there might be other less common hidden characters that are not taken into account by me such as vertical TAB etc... Turning on "show hidden characters", as you can do in almost any text editor, would be such a nice feature in SQL Server Query Analyzer, so I almost expect that it can be done somehow in SQL server as well... or at least that someone has an even better idea than mine, to show this kind of white space info.

I just noticed that there is a built-in way to see "white space", not in SQL Query Analyzer, but in the part of the interface that once was the SQL Enterprise manager. Right-click a table in SQL Management Studio Object Explorer tree, and select "Edit top 200 rows". In the result white space (at least CR LF) is visible as empty squares.

回答1:

You could create a function that addresses all the whitespace possibilites and enable only those that seem appropriate:

SELECT dbo.ShowWhiteSpace(myfield) from mytable

Uncomment only those whitespace cases you want to test for:


CREATE FUNCTION dbo.ShowWhiteSpace (@str varchar(8000))
RETURNS varchar(8000)
AS
BEGIN
     DECLARE @ShowWhiteSpace varchar(8000);
     SET @ShowWhiteSpace = @str
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(32), '[?]')
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(13), '[CR]')
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(10), '[LF]')
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(9),  '[TAB]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(1),  '[SOH]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(2),  '[STX]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(3),  '[ETX]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(4),  '[EOT]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(5),  '[ENQ]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(6),  '[ACK]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(7),  '[BEL]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(8),  '[BS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(11), '[VT]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(12), '[FF]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(14), '[SO]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(15), '[SI]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(16), '[DLE]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(17), '[DC1]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(18), '[DC2]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(19), '[DC3]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(20), '[DC4]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(21), '[NAK]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(22), '[SYN]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(23), '[ETB]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(24), '[CAN]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(25), '[EM]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(26), '[SUB]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(27), '[ESC]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(28), '[FS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(29), '[GS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(30), '[RS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(31), '[US]')
     RETURN(@ShowWhiteSpace)
END


回答2:

To find them, you can use this

;WITH cte AS
(
   SELECT 0 AS CharCode
   UNION ALL
   SELECT CharCode + 1 FROM cte WHERE CharCode <31
)
SELECT
   *
FROM
   mytable T
     cross join cte
WHERE
   EXISTS (SELECT *
        FROM mytable Tx
        WHERE Tx.PKCol = T.PKCol
             AND
              Tx.MyField LIKE '%' + CHAR(cte.CharCode) + '%'
         )

Replacing the EXISTS with a JOIN will allow you to REPLACE them, but you'll get multiple rows... I can't think of a way around that...



回答3:

They way I did it was by selecting all of the data

select * from myTable and then right-clicking on the result set and chose "Save results as..." a csv file.

Opening the csv file in Notepad++ I saw the LF characters not visible in SQL Server result set.



回答4:

You can always use the DATALENGTH Function to determine if you have extra white space characters in text fields. This won't make the text visible but will show you where there are extra white space characters.

    SELECT DATALENGTH('MyTextData ') AS BinaryLength, LEN('MyTextData ') AS TextLength

This will produce 11 for BinaryLength and 10 for TextLength.

In a table your SQL would like this:

    SELECT * 
    FROM tblA
    WHERE DATALENGTH(MyTextField) > LEN(MyTextField)

This function is usable in all versions of SQL Server beginning with 2005.



回答5:

select myfield, CAST(myfield as varbinary(max)) ...


回答6:

I have faced the same problem with a character that I never managed to match with a where query - CHARINDEX, LIKE, REPLACE, etc. did not work. Then I have used a brute force solution which is awful, heavy but works:

Step 1: make a copy of the complete data set - keep track of the original names with an source_id referencing the pk of the source table (and keep this source id in all the subsequent tables). Step 2: LTRIM RTRIM the data, and replace all double spaces, tab, etc (basically all the CHAR(1) to CHAR(32) by one space. Lowercase the whole set as well. Step 3: replace all the special characters that you know (get the list of all the quotes, double quotes, etc.) by something from a-z (I suggest z). Basically replace everything that is not standard English characters by a z (using nested REPLACE of REPLACE in a loop). Step 4: split by word into a second copy, where each word is in a separate row - the split is a SUBSTRING based on the position of the space characters - at this point, we should miss the ones where there's a hidden space that we did not catche earlier. Step 5: split each word into a third copy, where each letter is in a separate row (I know it makes a very large table) - keep track of the charindex of each letter in a separate column. Step 6: Select everything in the above table which is not LIKE [a-z]. This is the list of the unidentified characters we want to exclude.

From the output of step 6 we have enough data to make a series of substring of the source to select everything but the unknown character we want to exclude.

Note 1: there are smart ways to optimize this, depending on the size of the original expression (steps 4, 5 and 6 can be made in one go).

Note 2: this is not very fast, but the fastest way to get this done for a large data set, because the split of lines into words and words into letters is made by substring, which slices all the table into one character slices. However, this is quite heavy to build. With a smaller set, it may be enough to parse each record one by one and search for character which is not in a list of all English characters plus all special characters.