在同一行中创建多个字符串虚拟变量(Creating dummy variables from mul

2019-09-22 04:18发布

我有一个数据集,看起来像这样(请注意,一个空白的每个产品中分离):

Client_ID      Purchase
121212         "Orange_Juice Lettuce"
121212         "Banana Bread "
230102         "Banana Apple"
230102         "Chicken"
121212         "Chicken Bread"
301450         "Grapes Lettuce"
...            ...

现在,我想知道什么样的产品每个人的购买,使用每个项目一个虚拟变量:

Client_ID    Apple    Banana    Bread    Chicken    Grapes    Lettuce    Orange_Juice
121212       0        1         1        1          0         1          1  
230102       1        1         0        1          0         0          0
301450       0        0         0        0          1         1          0
...          ...      ...       ...      ...        ...       ...        ...

我问过类似的问题,几个星期前,但我没有在同一行中的几个项目,比如这里的情况。 所以我真的失去了。 我试图分开的多个列的项目,但效果并不理想,因为每次购买可以有不同数量的条目(多达几十种,据我所知)。

对如何进行任何想法? 提前致谢!

Answer 1:

下面是使用PROC FREQ和PROC TRANSPOSE一个灵活的解决方案。 稀疏的选项让你的零。 我想你只需要1或0,因此NODUPKEY排序; 删除NODUPKEY(或删除那种完全)如果你想2面包的第一个ID。

首先创建一个数据集垂直每ID /产品的一个记录(购买分割成产品); 然后PROC FREQ该数据集,所以你有1/0为每个客户端/产品组合的数据集; 然后转置,使用作为产品ID和算作VAR。

如果你有,你要保证显示为零,即使没有人有他们的任何产品,你应该添加一行到初始表(或之前PROC频率的任何东西)与虚拟客户端ID和所有可能的产品,然后经过转置删除虚拟客户编号。

data test;
input @1 Client_ID  6.   @16 Purchase $50.;
datalines;
121212         Orange_Juice Lettuce
121212         Banana Bread 
230102         Banana Apple
230102         Chicken
121212         Chicken Bread
301450         Grapes Lettuce
;;;;
run;

data vert;
set test;
format product $20.;
do _x = 1 by 1 until (missing(product));
  product=scan(purchase,_x);
  if not missing(product) then output;
end;
run;
proc sort data=vert nodupkey;
by client_id product;
run;

proc freq data=vert;
tables client_id*product/sparse out=prods;
run;

proc transpose data=prods out=horiz;
by client_id;
id product;
var count;
run;


Answer 2:

这里是一个数据步编程溶液:

proc sort data=have;
   by client_id;
run;
data want(keep=client_id apple banana bread chicken grapes lettuce orange_juice);
   set have;
      by client_id;
   retain apple banana bread chicken grapes lettuce orange_juice;
   if first.client_id then do;
      apple = 0;
      banana = 0;
      bread = 0 ;
      chicken = 0;
      grapes = 0;
      lettuce = 0;
      orange_juice = 0;
      end;
   length item $20;
   _x = 1;
   item = scan(purchase,_x);
   do while(item ne ' ');
      select(item);
         when('Apple') then apple = 1;
         when('Banana') then banana = 1;
         when('Bread') then bread = 1;
         when('Chicken') then chicken = 1;
         when('Grapes') then grapes = 1;
         when('Lettuce') then lettuce = 1;
         when(("Orange_Juice') then orange_juice = 1;
         otherwise;
         end;
      _x = _x + 1;
      item = scan(purchase,_x);
      end;
   if last.client_id then output;
run;

编辑:我错过了问题的一部分在每个多个项目PURCHASE变量。 感谢乔!



Answer 3:

这也是一个可行的解决方案,让SAS数据步做一些虚拟变量编码的为您服务。

data test;
input Client_ID 6. Purchase $50.;
datalines;
121212         Orange_Juice Lettuce
121212         Banana Bread 
230102         Banana Apple
230102         Chicken
121212         Chicken Bread
301450         Grapes Lettuce
 ;;;;
 run;

filename tmp temp;
 data _null_;
 set test end = done;
 file tmp;
 length product $25 prodlist $1000;
 retain prodlist;
 do i = 1 to countw( purchase, " " );
      product = scan( purchase, i, " " );
      prodlist = ifc( indexw( prodlist, product )=0, catx( ' ', prodlist, product ), prodlist );
 end;
 if done then do; 
    prodlinit=prxchange("s/ /=0; /",-1,compbl(prodlist)); 
    put 'array prods(*) ' prodlist ';'  / prodlinit;
 end;
 run;

 data new;
  set test;
   %inc tmp/source2;
   do i = 1 to dim( prods );
     if indexw(purchase,vname(prods(i))) > 0 then prods(i) = 1;
   end; 
  run;

proc print;
run;


文章来源: Creating dummy variables from multiple strings in the same row
标签: sas