remove duplicates from comma or pipeline operator

2019-07-27 05:36发布

问题:

I have been looking into this for a while now and I cannot find a way to remove duplicate strings from a comma-separated as well as pipeline seperated string in SQL Server.

Given the string

test1,test2,test1|test2,test3|test4,test4|test4

does anyone know how would you return test1,test2,test3,test4?

回答1:

Approach

The following approach can be used to de-duplicate a delimited list of values.

  1. Use the REPLACE() function to convert different delimiters into the same delimiter.
  2. Use the REPLACE() function to inject XML closing and opening tags to create an XML fragment
  3. Use the CAST(expr AS XML) function to convert the above fragment into the XML data type
  4. Use OUTER APPLY to apply the table-valued function nodes() to split the XML fragment into its constituent XML tags. This returns each XML tag on a separate row.
  5. Extract just the value from the XML tag using the value() function, and returns the value using the specified data type.
  6. Append a comma after the above-mentioned value.
  7. Note that these values are returned on separate rows. The usage of the DISTINCT keyword now removes duplicate rows (i.e. values).
  8. Use the FOR XML PATH('') clause to concatenate the values across multiple rows into a single row.

Query

Putting the above approach in query form:

SELECT DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)') + ',' 
FROM ( 
        -- This query returns the following in theDataXml column: 
        -- <tag>test1</tag><tag>test2</tag><tag>test1</tag><tag>test2</tag><tag>test3</tag><tag>test4</tag><tag>test4</tag><tag>test4</tag>
        -- i.e. it has turned the original delimited data into an XML fragment 
        SELECT 
          DataTable.DataColumn AS DataRaw 
        , CAST( 
            '<tag>' 
            -- First replace commas with pipes to have only a single delimiter 
            -- Then replace the pipe delimiters with a closing and opening tag 
            + replace(replace(DataTable.DataColumn, ',','|'), '|','</tag><tag>') 
            -- Add a final set of closing tags 
            + '</tag>' 
            AS XML) AS DataXml 
        FROM ( SELECT 'test1,test2,test1|test2,test3|test4,test4|test4' AS DataColumn) AS DataTable 
    ) AS x 
OUTER APPLY DataXml.nodes('tag') AS PivotedTable(PivotedColumn) 
-- Running the query without the following line will return the data in separate rows 
-- Running the query with the following line returns the rows concatenated, i.e. it returns: 
-- test1,test2,test3,test4, 
FOR XML PATH('') 

Input & Result

Given the input:

test1,test2,test1|test2,test3|test4,test4|test4

The above query will return the result:

test1,test2,test3,test4,

Notice the trailing comma at the end. I'll leave it as an exercise to you to remove that.


EDIT: Count of Duplicates

OP requested in a comment "how do i get t5he count of duplicates as well? in a seperate column".

The simplest way would be to use the above query but remove the last line FOR XML PATH(''). Then, counting all values and distinct values returned by the SELECT expression in the above query (i.e. PivotedTable.PivotedColumn.value('.','nvarchar(max)')). The difference between the count of all values and the count of distinct values is the count of duplicate values.

SELECT 
    COUNT(PivotedTable.PivotedColumn.value('.','nvarchar(max)'))            AS CountOfAllValues 
  , COUNT(DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)'))   AS CountOfUniqueValues 
    -- The difference of the previous two counts is the number of duplicate values 
  , COUNT(PivotedTable.PivotedColumn.value('.','nvarchar(max)')) 
    - COUNT(DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)')) AS CountOfDuplicateValues 
FROM ( 
        -- This query returns the following in theDataXml column: 
        -- <tag>test1</tag><tag>test2</tag><tag>test1</tag><tag>test2</tag><tag>test3</tag><tag>test4</tag><tag>test4</tag><tag>test4</tag>
        -- i.e. it has turned the original delimited data into an XML fragment 
        SELECT 
          DataTable.DataColumn AS DataRaw 
        , CAST( 
            '<tag>' 
            -- First replace commas with pipes to have only a single delimiter 
            -- Then replace the pipe delimiters with a closing and opening tag 
            + replace(replace(DataTable.DataColumn, ',','|'), '|','</tag><tag>') 
            -- Add a final set of closing tags 
            + '</tag>' 
            AS XML) AS DataXml 
        FROM ( SELECT 'test1,test2,test1|test2,test3|test4,test4|test4' AS DataColumn) AS DataTable 
    ) AS x 
OUTER APPLY DataXml.nodes('tag') AS PivotedTable(PivotedColumn) 

For the same input shown above, the output of this query is:

CountOfAllValues CountOfUniqueValues CountOfDuplicateValues
---------------- ------------------- ----------------------
8                4                   4


回答2:

Solution to your problem is as given below :

DECLARE @Data_String AS VARCHAR(1000), @Result as varchar(1000)=''
SET @Data_String = 'test1,test2,test1|test2,test3|test4,test4|test4'
SET @Data_String = REPLACE(@Data_String,'|',',')
SELECT @Result=@Result+col+',' from(
SELECT DISTINCT t.c.value('.','varchar(100)') col from(
SELECT cast('<A>'+replace(@Data_String,',','</A><A>')+'</A>' as     xml)col1)data cross apply col1.nodes('/A') as t(c))Data
SELECT LEFT(@Result,LEN(@Result)-1)

Result

test1,test2,test3,test4


回答3:

    DECLARE @string AS VARCHAR(1000) 
    SET @string = 'test1,test2,test1|test2,test3|test4,test4|test4'
    SET @string = REPLACE(@string,'|',',')
    DECLARE @t TABLE (val VARCHAR(MAX)) 

    DECLARE @xml XML
    SET @xml = N'<root><r>' + REPLACE(@string, ',', '</r><r>') +         '</r></root>'
    INSERT INTO @t(val) SELECT r.value('.','VARCHAR(MAX)') as Item FROM         @xml.nodes('//root/r') AS RECORDS(r)
    ;WITH cte
    AS (SELECT ROW_NUMBER() OVER (PARTITION BY val ORDER BY val desc) RN
    FROM  @t)
    DELETE FROM cte
    WHERE  RN > 1


回答4:

Try Following SQL Script :

declare @List nvarchar(max)='test1,test2,test1|test2,test3|test4,test4|test4';
declare @Delimiter CHAR(1) =','
declare @XML AS XML
declare @result varchar(max)
set @List=Replace(@List,'|',',')
--Select @List

SET @XML = CAST(('<X>'+REPLACE(@List,@Delimiter ,'</X><X>')+'</X>') AS XML)
DECLARE @temp TABLE (Data nvarchar(100))
INSERT INTO @temp
SELECT N.value('.', 'nvarchar(100)') AS Data FROM @XML.nodes('X') AS T(N)
--SELECT distinct * FROM @temp

IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp
Select distinct Data into #temp from @temp

SET @result = ''
select @result = @result + Data + ', ' from #temp
select SUBSTRING(@result, 0, LEN(@result))


回答5:

I just tried following script working perfectly :

declare @List VARCHAR(MAX)='test1,test2,test1|test2,test3|test4,test4|test4'
declare @Delim CHAR=','
DECLARE @ParsedList TABLE
(
Item VARCHAR(MAX)
)
DECLARE @list1 VARCHAR(MAX), @Pos INT, @rList VARCHAR(MAX)
set @List=Replace(@List,'|',',')
SET @list = LTRIM(RTRIM(@list)) + @Delim
SET @pos = CHARINDEX(@delim, @list, 1)
WHILE @pos > 0
BEGIN
SET @list1 = LTRIM(RTRIM(LEFT(@list, @pos - 1)))
IF @list1 <> ''
INSERT INTO @ParsedList VALUES (CAST(@list1 AS VARCHAR(MAX)))
SET @list = SUBSTRING(@list, @pos+1, LEN(@list))
SET @pos = CHARINDEX(@delim, @list, 1)
END
SELECT @rlist = COALESCE(@rlist+',','') + item
FROM (SELECT DISTINCT Item FROM @ParsedList) t
Select @rlist