Conditional Probability Table in SAS

2019-09-03 15:22发布

I am working in SAS trying to create a conditional probability table.

The current structure of the table is: 5 columns x 10 rows --> the value in each cell is binary. Current Data Table

col1    col2    col3    col4    col5
1   0   1   0   0
0   0   0   1   1
0   0   0   0   0
1   0   0   0   0
1   0   0   0   1
0   1   0   0   0
0   1   0   1   0
1   1   1   1   0
1   0   1   0   1
1   0   1   0   0

I would like to create a table with the conditional probability for every column vs every other column. Ideal Output

--- col1    col2    col3    col4    col5
col1    1.0 0.3 1.0 0.3 0.7
col2    0.2 1.0 0.3 0.7 0.0
col3    0.7 0.3 1.0 0.3 0.3
col4    0.2 0.7 0.3 1.0 0.3
col5    0.3 0.0 0.3 0.3 1.0

This is a much simpler version of the actual problem I am working on (100s of rows & millions of columns, so I'd ideally have a solution which could adjust based on the size of the table).

I've been working with the array and do loop, but haven't been able to get very far.

My current code looks like this (not close to complete):

data ideal_output;
    set binary_table;
    array obs(10,5);
    array output(5,5);
    do i=1 to 5;
        do j=1 to 5;
            do k=1 to 10;
                do l=1 to 10;
        output(m,n) = sum(obs(k,i)*obs(l,j))/sum(obs(k,i));
    end;end;end;end;
run;

2条回答
在下西门庆
2楼-- · 2019-09-03 16:09

You can probably do something equivalent with a summarization proc. It will be a bit messy as you'll have to do some transposing probably and get rid of the '0' rows, but this will start you off perhaps?

proc tabulate data=have out=want;
  class col1-col5;
  tables (col1-col5),(col1-col5)*colpctn/printmiss misstext='0';
run;

data want_fortran;
  set want;
  if sum(of col1-col5) = 2;
run;

Then you can use which columns of col1-col5 are populated to generate column/row names and transpose the dataset.

查看更多
神经病院院长
3楼-- · 2019-09-03 16:24

You have the right sort of idea - the tricky part is loading all your variables into the appropriate arrays. If your full dataset is too large to fit into memory you may need to process one subset of it at a time.

data have;
/*Set length 3 for binary vars to save a bit of memory later*/
length col1-col5 3;
input col1-col5;
cards;
1   0   1   0   0
0   0   0   1   1
0   0   0   0   0
1   0   0   0   0
1   0   0   0   1
0   1   0   0   0
0   1   0   1   0
1   1   1   1   0
1   0   1   0   1
1   0   1   0   0
;
run;

%let NCOLS = 5;
%let NOBS = 10;

data want;
    if 0 then set have;
    array obs[&NOBS,&NCOLS];
    array p[&NCOLS];
    array col[&NCOLS];

    /*Use a DOW-loop to populate the 2-d array*/
    do _n_ = 1 by 1 until (eof);
        set have end = eof;
        do i = 1 to &NCOLS;
            obs[_n_,i] = col[i];
        end;
    end;

    do i=1 to &NCOLS;
        do j=1 to &NCOLS;
            x = 0;
            y = 0;
            do k=1 to &NOBS;
                x + obs[k,i]*obs[k,j];
                y + obs[k,j];
            end;
            p[j] = x / y;
        end;
        output;
    end;
    keep p1-p5; 
run;
查看更多
登录 后发表回答