我有一个数据集,看起来像这样(请注意,一个空白的每个产品中分离):
Client_ID Purchase
121212 "Orange_Juice Lettuce"
121212 "Banana Bread "
230102 "Banana Apple"
230102 "Chicken"
121212 "Chicken Bread"
301450 "Grapes Lettuce"
... ...
现在,我想知道什么样的产品每个人的购买,使用每个项目一个虚拟变量:
Client_ID Apple Banana Bread Chicken Grapes Lettuce Orange_Juice
121212 0 1 1 1 0 1 1
230102 1 1 0 1 0 0 0
301450 0 0 0 0 1 1 0
... ... ... ... ... ... ... ...
我问过类似的问题,几个星期前,但我没有在同一行中的几个项目,比如这里的情况。 所以我真的失去了。 我试图分开的多个列的项目,但效果并不理想,因为每次购买可以有不同数量的条目(多达几十种,据我所知)。
对如何进行任何想法? 提前致谢!
下面是使用PROC FREQ和PROC TRANSPOSE一个灵活的解决方案。 稀疏的选项让你的零。 我想你只需要1或0,因此NODUPKEY排序; 删除NODUPKEY(或删除那种完全)如果你想2面包的第一个ID。
首先创建一个数据集垂直每ID /产品的一个记录(购买分割成产品); 然后PROC FREQ该数据集,所以你有1/0为每个客户端/产品组合的数据集; 然后转置,使用作为产品ID和算作VAR。
如果你有,你要保证显示为零,即使没有人有他们的任何产品,你应该添加一行到初始表(或之前PROC频率的任何东西)与虚拟客户端ID和所有可能的产品,然后经过转置删除虚拟客户编号。
data test;
input @1 Client_ID 6. @16 Purchase $50.;
datalines;
121212 Orange_Juice Lettuce
121212 Banana Bread
230102 Banana Apple
230102 Chicken
121212 Chicken Bread
301450 Grapes Lettuce
;;;;
run;
data vert;
set test;
format product $20.;
do _x = 1 by 1 until (missing(product));
product=scan(purchase,_x);
if not missing(product) then output;
end;
run;
proc sort data=vert nodupkey;
by client_id product;
run;
proc freq data=vert;
tables client_id*product/sparse out=prods;
run;
proc transpose data=prods out=horiz;
by client_id;
id product;
var count;
run;
这里是一个数据步编程溶液:
proc sort data=have;
by client_id;
run;
data want(keep=client_id apple banana bread chicken grapes lettuce orange_juice);
set have;
by client_id;
retain apple banana bread chicken grapes lettuce orange_juice;
if first.client_id then do;
apple = 0;
banana = 0;
bread = 0 ;
chicken = 0;
grapes = 0;
lettuce = 0;
orange_juice = 0;
end;
length item $20;
_x = 1;
item = scan(purchase,_x);
do while(item ne ' ');
select(item);
when('Apple') then apple = 1;
when('Banana') then banana = 1;
when('Bread') then bread = 1;
when('Chicken') then chicken = 1;
when('Grapes') then grapes = 1;
when('Lettuce') then lettuce = 1;
when(("Orange_Juice') then orange_juice = 1;
otherwise;
end;
_x = _x + 1;
item = scan(purchase,_x);
end;
if last.client_id then output;
run;
编辑:我错过了问题的一部分在每个多个项目PURCHASE
变量。 感谢乔!
这也是一个可行的解决方案,让SAS数据步做一些虚拟变量编码的为您服务。
data test;
input Client_ID 6. Purchase $50.;
datalines;
121212 Orange_Juice Lettuce
121212 Banana Bread
230102 Banana Apple
230102 Chicken
121212 Chicken Bread
301450 Grapes Lettuce
;;;;
run;
filename tmp temp;
data _null_;
set test end = done;
file tmp;
length product $25 prodlist $1000;
retain prodlist;
do i = 1 to countw( purchase, " " );
product = scan( purchase, i, " " );
prodlist = ifc( indexw( prodlist, product )=0, catx( ' ', prodlist, product ), prodlist );
end;
if done then do;
prodlinit=prxchange("s/ /=0; /",-1,compbl(prodlist));
put 'array prods(*) ' prodlist ';' / prodlinit;
end;
run;
data new;
set test;
%inc tmp/source2;
do i = 1 to dim( prods );
if indexw(purchase,vname(prods(i))) > 0 then prods(i) = 1;
end;
run;
proc print;
run;