Group by sql query on comma joined column

2019-01-19 08:44发布

问题:

My table structure is like below, "Mail" column can contain multiple email joined by comma

Data(int)

Mail(varchar(200))

[Data] [Mail]


1          m1@gmail.com,m2@hotmail.com

2          m2@hotmail.com,m3@test.com

& I need to generate the report like below, counting each row per each email

[Mail]                   [Count]


m1@gmail.com          1

m2@hotmail.com       2

m3@test.com            1

So what will be the sql(server) query to generate like above? Also I can't change the table structure.

回答1:

String splitting is faster using only CHARINDEX without XML or CTE.

Sample table

create table #tmp ([Data] int, [Mail] varchar(200))
insert #tmp SELECT 1,'m1@gmail.com,m2@hotmail.com,other, longer@test, fifth'
UNION ALL   SELECT 2,'m2@hotmail.com,m3@test.com'
UNION ALL   SELECT 3,'m3@single.com'
UNION ALL   SELECT 4,''
UNION ALL   SELECT 5,null

The query

select single, count(*) [Count]
from
(
    select ltrim(rtrim(substring(t.mail, v.number+1,
        isnull(nullif(charindex(',',t.mail,v.number+1),0)-v.number-1,200)))) single
    from #tmp t
    inner join master..spt_values v on v.type='p'
        and v.number <= len(t.Mail)
        and (substring(t.mail,v.number,1) = ',' or v.number=0)
) X
group by single

The only parts you supply are

  • #tmp: your table name
  • #mail: the column name


回答2:

A SQL Server Solution

WITH T ([Data], [Mail])
     AS (SELECT 1,'m1@gmail.com,m2@hotmail.com' UNION ALL
         SELECT 2,'m2@hotmail.com,m3@test.com')
SELECT address  AS Mail,
       COUNT(*) AS [Count]
FROM   T
       CROSS APPLY (SELECT CAST('<m>' + REPLACE([Mail], ',', '</m><m>') + '</m>'
                                AS XML
                           ) AS x) ca1
       CROSS APPLY (SELECT T.split.value('.', 'varchar(200)') AS address
                    FROM   x.nodes('/m') T(split)) ca
GROUP  BY address  


回答3:

SQL Server Using a recursive cte.

declare @Mail table (ID int, Mail varchar(200))

insert into @Mail values
(1, 'm1@gmail.com,m2@hotmail.com'),
(2, 'm2@hotmail.com,m3@test.com'),
(3, 'm2@hotmail.com')

;with cte1 as
(
  select Mail+',' as Mail
  from @Mail
),
cte2
as
(
  select
    left(Mail, charindex(',', Mail)-1) as Mail1,
    right(Mail, len(Mail)-charindex(',', Mail)) as Mail
  from cte1
  union all
  select
    left(Mail, charindex(',', Mail)-1) as Mail1,
    right(Mail, len(Mail)-charindex(',', Mail)) as Mail
  from cte2
  where charindex(',', Mail) > 1
)
select
  Mail1 as Mail,
  count(*) as [Count]
from cte2
group by Mail1

Edit 1 Same as before but handles the case where there is only one email in Mail



回答4:

The correct thing to do would be to add a related table to store multiple emails. It is virtually always a bad design decision to store things in a comma delimited list as you have found in trying to query it. This generally means you need to create a related table as you have a one-to-many relationship. The task you want to do is trivial if you have properly related tables.

I don't buy I can't change the table structure as an excuse. Unless this is a commercial product that your company doesn't own, you can change the structure, you just need to show management why it is necessary. Someone at your organization can change the database structure, find out who and convince him as to why it needs to change. If it is a commercial database, consider creating a trigger on the tble to populate a realted table that you create every time the email field is inserted updated or deleted. Then at least you only have to go through the splitting process once for each record change rather than every time the query is run.



回答5:

Very similar to Mikael's answer with minor tweaks...
- Have a field with a 'cached' LEN to avoid having to repeatedly count the length
- Use only one UNION each recursion by replacing a 0 CHARINDEX with NULL

These differences will only really show noticably for long lists, and so with several levels of recursion.


The CROSS APPLY business is just to make the SELECT more tidy, rather than repeat the NULLIF(CHARINDEX) loads of times.


WITH
  source (
    Data,
    Mail
  )
AS
(
  SELECT 1,'m1@gmail.com,m2@hotmail.com' UNION ALL
  SELECT 2,'m2@hotmail.com,m3@test.com'
)
,
  split_cte
AS
(
  SELECT
    LEFT (mail, ISNULL(comma - 1, LEN(mail)))     AS "current_mail",
    RIGHT(mail, ISNULL(LEN(mail) - comma, 0))     AS "mail_data",
    ISNULL(LEN(mail) - comma, 0)                  AS "chars"
  FROM
    source
  CROSS APPLY
    (SELECT NULLIF(CHARINDEX(',', mail), 0) AS "comma") AS search

  UNION ALL

  SELECT
    LEFT (mail_data, ISNULL(comma - 1, chars))    AS "current_mail",
    RIGHT(mail_data, ISNULL(chars - comma, 0))    AS "mail_data",
    ISNULL(chars - comma, 0)                      AS "chars"
  FROM
    split_cte
  CROSS APPLY
    (SELECT NULLIF(CHARINDEX(',', mail_data), 0) AS "comma") AS search
  WHERE
    chars > 0
)

SELECT
  current_mail     AS "Mail",
  COUNT(*)         AS "Count"
FROM
  split_cte
GROUP BY
  current_mail


回答6:

This is a supplementary answer to show the performance of the various options:

Fill a sample table with some data

create table tmp1 ([Data] int, [Mail] varchar(200))
insert tmp1 SELECT 1,'m1@gmail.com,m2@hotmail.com,other, longer@test, fifth'
UNION ALL   SELECT 2,'m2@hotmail.com,m3@test.com'
UNION ALL   SELECT 3,'m3@single.com'
UNION ALL   SELECT 4,''
UNION ALL   SELECT 5,null

insert tmp1
select data*10000 + number, mail
from tmp1, master..spt_values v
where v.type='P'

-- total rows: 10245

Test query:

set statistics io on
set statistics time on

dbcc dropcleanbuffers dbcc freeproccache

select single, count(*) [Count]
from
(
    select ltrim(rtrim(substring(t.mail, v.number+1,
        isnull(nullif(charindex(',',t.mail,v.number+1),0)-v.number-1,200)))) single
    from tmp1 t
    inner join master..spt_values v on v.type='p'
        and v.number <= len(t.Mail)
        and (substring(t.mail,v.number,1) = ',' or v.number=0)
) X
group by single

dbcc dropcleanbuffers dbcc freeproccache

;with cte1 as
(
  select Mail+',' as Mail
  from tmp1
),
cte2
as
(
  select
    left(Mail, charindex(',', Mail)-1) as Mail1,
    right(Mail, len(Mail)-charindex(',', Mail)) as Mail
  from cte1
  union all
  select
    left(Mail, charindex(',', Mail)-1) as Mail1,
    right(Mail, len(Mail)-charindex(',', Mail)) as Mail
  from cte2
  where charindex(',', Mail) > 1
)
select
  Mail1 as Mail,
  count(*) as [Count]
from cte2
group by Mail1

dbcc dropcleanbuffers dbcc freeproccache

--SET ANSI_DEFAULTS ON
--SET ANSI_NULLS ON
;
SELECT address  AS Mail,
       COUNT(*) AS [Count]
FROM   tmp1
       CROSS APPLY (SELECT CAST('<m>' + REPLACE([Mail], ',', '</m><m>') + '</m>'
                                AS XML
                           ) AS x) ca1
       CROSS APPLY (SELECT T.split.value('.', 'varchar(200)') AS address
                    FROM   x.nodes('/m') T(split)) ca
GROUP  BY address  

Statistics

Run a few time to get a feel for the averages

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'spt_values'. Scan count 8196, logical reads 26637, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'tmp1'. Scan count 3, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 641 ms,  elapsed time = 412 ms.

Table 'Worktable'. Scan count 2, logical reads 103271, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'tmp1'. Scan count 1, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 609 ms,  elapsed time = 614 ms.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'tmp1'. Scan count 3, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 2798 ms,  elapsed time = 1421 ms.

Table 'Worktable'. Scan count 2, logical reads 103334, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'tmp1'. Scan count 1, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 734 ms,  elapsed time = 742 ms.

Summary

First (CHARINDEX) : CPU time = 344 ms, elapsed time = 198 ms.
Second (CTE) : CPU time = 594 ms, elapsed time = 613 ms.
Third (XML) : CPU time = 2812 ms, elapsed time = 1418 ms.
Fourth (CTE2) : CPU time = 719 ms, elapsed time = 750 ms.