How to summarize all possible combinations of vari

2020-06-06 08:11发布

问题:

I am trying to summarize the count based on the all possible combinations of variables. Here is an example data:

回答1:

For this sort of query using some of the built in aggregate tools is quite straight forward.

First off setup some sample data based on your sample image:

declare @Table1 as table
    ([id] int, [a] int, [b] int, [c] int)
;

INSERT INTO @Table1
    ([id], [a], [b], [c])
VALUES
    (10001, 1, 3, 3),
    (10002, 0, 0, 0),
    (10003, 3, 6, 0),
    (10004, 7, 0, 0),
    (10005, 0, 0, 0)
;

Since you want the count of IDs for each possible combination of non zero attributes A, B, and C, the first step is eliminate the zeros and convert the non zero values to a single value we can summarize on, in this case I'll use the attributes name. After that it's a simple matter of performing the aggregate, using the CUBE clause in the group by statement to generate the combinations. Lastly in the having clause prune out the unwanted summations. Mostly that's just ignoring the null values in the attributes, and optionally removing the grand summary (count of all rows)

with t1 as (
select case a when 0 then null else 'a' end a
     , case b when 0 then null else 'b' end b
     , case c when 0 then null else 'c' end c
     , id
  from @Table1
)
select a, b, c, count(id) cnt
  from t1
  group by cube(a,b,c)
  having (a is not null or grouping(a) = 1) -- For each attribute
     and (b is not null or grouping(b) = 1) -- only allow nulls as
     and (c is not null or grouping(c) = 1) -- a result of grouping.
     and grouping_id(a,b,c) <> 7  -- exclude the grand total
  order by grouping_id(a,b,c);

Here are the results:

    a       b       c       cnt
1   a       b       c       1
2   a       b       NULL    2
3   a       NULL    c       1
4   a       NULL    NULL    3
5   NULL    b       c       1
6   NULL    b       NULL    2
7   NULL    NULL    c       1

And finally my original rextester link: http://rextester.com/YRJ10544

@lad2025 Here's a dynamic version (sorry my SQL Server skills aren't as strong as my Oracle skills, but it works). Just set the correct values for @Table and @col and it should work as long as all other columns are numeric attributes:

declare @sql varchar(max), @table varchar(30), @col varchar(30);
set @table = 'Table1';
set @col = 'id';
with x(object_id, column_id, name, names, proj, pred, max_col, cnt) 
  as (
    select object_id, column_id, name, cast(name as varchar(max))
     , cast('case '+name+' when 0 then null else '''+name+''' end '+name as varchar(4000))
     , cast('('+name+' is not null or grouping('+name+') = 1)' as varchar(4000))
     , (select max(column_id) from sys.columns m where m.object_id = c.object_id and m.name <>'ID')
     , 1
     from sys.columns c
    where object_id = OBJECT_ID(@Table)
      and column_id = (select min(column_id) from sys.columns m where m.object_id = c.object_id and m.name <> @col)
    union all
    select x.object_id, c.column_id, c.name, cast(x.names+', '+c.name as varchar(max))
     , cast(proj+char(13)+char(10)+'     , case '+c.name+' when 0 then null else '''+c.name+''' end '+c.name as varchar(4000))
     , cast(pred+char(13)+char(10)+'   and ('+c.name+' is not null or grouping('+c.name+') = 1)' as varchar(4000))
     , max_col
     , cnt+1
      from x join sys.columns c on c.object_id = x.object_id and c.column_id = x.column_id+1
)
select @sql='with t1 as (
select '+proj+'
     , '+@col+'
  from '+@Table+'
)
select '+names+'
     , count('+@col+') cnt 
  from t1
 group by cube('+names+')
having '+pred+'
   and grouping_id('+names+') <> '+cast(power(2,cnt)-1 as varchar(10))+'
 order by grouping_id('+names+');'
  from x where column_id = max_col;

select @sql sql;
exec (@sql);

Rextester



回答2:

Poshan:

As Robert stated, SUMMARY can be used to count combinations. A second SUMMARY can count the computed types. One difficulty is ignoring the combinations that involve a zero value. If they can be converted to missings the processing is much cleaner. Presuming zeros converted to missing, this code would count distinct combinations:

proc summary noprint data=have;
  class v2-v4 s1;
  output out=counts_eachCombo;
run;

proc summary noprint data=counts_eachCombo(rename=_type_=combo_type);
  class combo_type;
  output out=counts_eachClassType;
run;

You can see how the use of a CLASS variable in a combination determines the TYPE, and the class variables can be of mixed type (numeric, character)

A different 'home-grown' approach that does not use SUMMARY can use data step with LEXCOMB to compute each combination and SQL with into / separated to generate a SQL statement that will count each distinctly.

Note: The following code contains macro varListEval for resolving a SAS variable list to individual variable names.

%macro makeHave(n=,m=,maxval=&m*4,prob0=0.25);

  data have;
    do id = 1 to &n;
      array v v1-v&m;
      do over v;
        if ranuni(123) < &prob0 then v = 0; else v = ceil(&maxval*ranuni(123));
      end;
      s1 = byte(65+5*ranuni(123));
      output;
    end;
  run;

%mend;

%makeHave (n=100,m=5,maxval=15)

%macro varListEval (data=, var=);
  %* resolve a SAS variable list to individual variable names;
  %local dsid dsid2 i name num;
  %let dsid = %sysfunc(open(&data));
  %if &dsid %then %do;
    %let dsid2 = %sysfunc(open(&data(keep=&var)));
    %if &dsid2 %then %do;
      %do i = 1 %to %sysfunc(attrn(&dsid,nvar));
        %let name = %sysfunc(varname(&dsid,&i));
        %let num = %sysfunc(varnum(&dsid2,&name));
        %if &num %then "&NAME";
      %end;
      %let dsid2 = %sysfunc(close(&dsid2));
    %end;
    %let dsid = %sysfunc(close(&dsid));
  %end;
  %else
    %put %sysfunc(sysmsg());
%mend;

%macro combosUCounts(data=, var=);
  %local vars n;
  %let vars = %varListEval(data=&data, var=&var);

  %let n = %eval(1 + %sysfunc(count(&vars,%str(" ")));

  * compute combination selectors and criteria;
  data combos;
    array _names (&n) $32 (&vars);
    array _combos (&n) $32;
    array _comboCriterias (&n) $200;

    length _selector $32000;
    length _criteria $32000;

    if 0 then set &data; %* prep PDV for vname;

    do _k = 1 to &n;
      do _j = 1 to comb(&n,_k);
        _rc = lexcomb(_j,_k, of _names[*]);
        do _p = 1 to _k;
          _combos(_p) = _names(_p);
          if vtypex(_names(_p)) = 'C' 
            then _comboCriterias(_p) = trim(_names(_p)) || " is not null and " || trim(_names(_p)) || " ne ''";
            else _comboCriterias(_p) = trim(_names(_p)) || " is not null and " || trim(_names(_p)) || " ne 0";
        end;
        _selector = catx(",", of _combos:);
        _criteria = catx(" and ", of _comboCriterias:);
        output;
      end;
    end;

    stop;
  run;

  %local union;

  proc sql noprint;
    * generate SQL statement that uses combination selectors and criteria;
    select "select "
    || quote(trim(_selector))
    || " as combo" 
    || ", "
    || "count(*) as uCount from (select distinct "
    || trim(_selector)
    || " from &data where "
    || trim(_criteria)
    || ")"
    into :union separated by " UNION "
    from combos
    ;

    * perform the generated SQL statement;
    create table comboCounts as
    &union;

    /* %put union=%superq(union); */
  quit;
%mend;

options mprint nosymbolgen;
%combosUCounts(data=have, var=v2-v4);
%combosUCounts(data=have, var=v2-v4 s1);

%put NOTE: Done;
/*
data _null_;
put %varListEval(data=have, var=v2-v4) ;
run;
*/


回答3:

Naive approach SQL Server version (I've assumed that we always have 3 columns so there will be 2^3-1 rows):

SELECT 'A' AS combination, COUNT(DISTINCT CASE WHEN a > 0 THEN a ELSE NULL END) AS cnt FROM t
UNION ALL 
SELECT 'B', COUNT(DISTINCT CASE WHEN b > 0 THEN a ELSE NULL END) FROM t
UNION ALL 
SELECT 'C', COUNT(DISTINCT CASE WHEN c > 0 THEN a ELSE NULL END) FROM t
UNION ALL
SELECT 'A,B', COUNT(DISTINCT CASE WHEN a > 0 THEN CAST(a AS VARCHAR(10)) ELSE NULL END 
                     + ',' + CASE WHEN b > 0 THEN CAST(b AS VARCHAR(10)) ELSE NULL END) FROM t
UNION ALL
SELECT 'A,C', COUNT(DISTINCT CASE WHEN a > 0 THEN CAST(a AS VARCHAR(10)) ELSE NULL END 
                     + ',' + CASE WHEN c > 0 THEN CAST(c AS VARCHAR(10)) ELSE NULL END) FROM t
UNION ALL
SELECT 'B,C', COUNT(DISTINCT CASE WHEN b > 0 THEN CAST(b AS VARCHAR(10)) ELSE NULL END 
                     + ',' + CASE WHEN c > 0 THEN CAST(c AS VARCHAR(10)) ELSE NULL END) FROM t
UNION ALL
SELECT 'A,B,C', COUNT(DISTINCT CASE WHEN a > 0 THEN CAST(a AS VARCHAR(10)) ELSE NULL END 
                     + ',' + CASE WHEN b > 0 THEN CAST(b AS VARCHAR(10)) ELSE NULL END
                     + ',' + CASE WHEN c > 0 THEN CAST(c AS VARCHAR(10)) ELSE NULL END ) FROM t
ORDER BY combination 

Rextester Demo


EDIT:

Same as above but more concise:

WITH cte AS (
    SELECT ID
          ,CAST(NULLIF(a,0) AS VARCHAR(10)) a
          ,CAST(NULLIF(b,0) AS VARCHAR(10)) b
          ,CAST(NULLIF(c,0) AS VARCHAR(10)) c 
    FROM t
)
SELECT 'A' AS combination, COUNT(DISTINCT a) AS cnt FROM cte UNION ALL 
SELECT 'B', COUNT(DISTINCT b) FROM cte UNION ALL 
SELECT 'C', COUNT(DISTINCT c) FROM cte UNION ALL
SELECT 'A,B', COUNT(DISTINCT a + ',' + b) FROM cte UNION ALL
SELECT 'A,C', COUNT(DISTINCT a + ',' + c) FROM cte UNION ALL
SELECT 'B,C', COUNT(DISTINCT b + ',' + c) FROM cte UNION ALL
SELECT 'A,B,C', COUNT(DISTINCT a + ',' + b + ',' + c ) FROM cte ;

Rextester Demo


EDIT 2

Using UNPIVOT:

WITH cte AS (SELECT ID
               ,CAST(IIF(a!=0,1,NULL) AS VARCHAR(10)) a
               ,CAST(IIF(b!=0,1,NULL) AS VARCHAR(10)) b
               ,CAST(IIF(c!=0,1,NULL) AS VARCHAR(10)) c 
            FROM t)
SELECT combination, [count]
FROM (SELECT  a=COUNT(a), b=COUNT(b), c=COUNT(c)
           , ab=COUNT(a+b), ac=COUNT(a+c), bc=COUNT(b+c), abc=COUNT(a+b+c)
      FROM cte) s
UNPIVOT ([count] FOR combination IN (a,b,c,ab,ac,bc,abc))AS unpvt

Rextester Demo


EDIT FINAL APPROACH

I appreciate your approach. I have more than 3 variables in my actual dataset and do you think we can generate all possible combinations programatically rather than the hard coding them! May be your second approach will cover that :

SQL is a bit clumsy to do this kind of operation, but I want to show it is possible.

CREATE TABLE t(id INT, a INT, b INT, c INT);

INSERT INTO t
SELECT 10001,1,3,3 UNION
SELECT 10002,0,0,0 UNION
SELECT 10003,3,6,0 UNION
SELECT 10004,7,0,0 UNION
SELECT 10005,0,0,0;

DECLARE @Sample AS TABLE 
(
    item_id     tinyint IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
    item        nvarchar(500) NOT NULL,
    bit_value   AS  CONVERT ( integer, POWER(2, item_id - 1) )
                PERSISTED UNIQUE CLUSTERED
);    

INSERT INTO @Sample
SELECT name
FROM sys.columns
WHERE object_id = OBJECT_ID('t')
  AND name != 'id';

DECLARE @max integer = POWER(2, ( SELECT COUNT(*) FROM @Sample AS s)) - 1;
DECLARE @cols NVARCHAR(MAX);
DECLARE @cols_casted NVARCHAR(MAX);
DECLARE @cols_count NVARCHAR(MAX);


;WITH
  Pass0 as (select 1 as C union all select 1), --2 rows
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),--4 rows
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),--16 rows
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),--256 rows
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),--65536 rows
  Tally as (select row_number() over(order by C) as n from Pass4)
, cte AS (SELECT
    combination =
        STUFF
        (
            (
                SELECT ',' + s.item 
                FROM @Sample AS s
                WHERE
                    n.n & s.bit_value = s.bit_value
                ORDER BY
                    s.bit_value
                FOR XML 
                    PATH (''),
                    TYPE                    
            ).value('(./text())[1]', 'varchar(8000)'), 1, 1, ''
        )
FROM Tally AS N
WHERE N.n BETWEEN 1 AND @max
)
SELECT @cols = STRING_AGG(QUOTENAME(combination),',')
      ,@cols_count = STRING_AGG(FORMATMESSAGE('[%s]=COUNT(DISTINCT %s)'
                    ,combination,REPLACE(combination, ',', ' + '','' +') ),',')
FROM cte;

SELECT 
  @cols_casted = STRING_AGG(FORMATMESSAGE('CAST(NULLIF(%s,0) AS VARCHAR(10)) %s'
                 ,name, name), ',')
FROM sys.columns
WHERE object_id = OBJECT_ID('t')
  AND name != 'id';

DECLARE @sql NVARCHAR(MAX);

SET @sql =
'SELECT combination, [count]
FROM (SELECT  <cols_count>
      FROM (SELECT ID, <cols_casted> FROM t )cte) s
UNPIVOT ([count] FOR combination IN (<cols>))AS unpvt';

SET @sql = REPLACE(@sql, '<cols_casted>', @cols_casted);
SET @sql = REPLACE(@sql, '<cols_count>', @cols_count);
SET @sql = REPLACE(@sql, '<cols>', @cols);

SELECT @sql;
EXEC (@sql);

DBFiddle Demo

DBFiddle Demo with 4 variables