所以,我试图建立自动分组。 我们的目标是选择具有最小方差分组设置。
换句话说,我想找到x和y以下,x,y是自然数,
GROUP 1: 1997 - x
GROUP 2: x+1 - y
GROUP 3: y+1 - 1994
使得(方差(的SUM Response
Group1中),方差( Response
在第2组),方差( Response
于组3))被最小化。
data maindat;
input Year Response ;
datalines;
1994 -4.300511714
1994 -9.646920963
1994 -15.86956805
1993 -16.14857235
1993 -13.05797186
1993 -13.80941206
1992 -3.521394503
1992 -1.102526302
1992 -0.137573583
1992 2.669238665
1992 -9.540489193
1992 -19.27474303
1992 -3.527077011
1991 1.676464068
1991 -2.238822314
1991 4.663079037
1991 -5.346920963
1990 -8.543723186
1990 0.507460641
1990 0.995302284
1990 0.464194011
1989 4.728791571
1989 5.578685423
1988 2.771297564
1988 7.109159247
1987 15.96059456
1987 2.985292226
1986 -4.301136971
1985 5.854674875
1985 5.797294021
1984 4.393329025
1983 -6.622580905
1982 0.268500302
1977 12.23062252
;
run;
我的想法是,我将有2做循环(嵌套)
1st do loop (1st iteration): Group 1 1977 - 1977 1977 - 1977 1977 - 1977 … 1977 - 1977
2nd do loop: Group 2 1978 - 1978 1978 - 1979 1978 - 1980 … 1978 - 1993
Else: Group 3 1979 - 1994 1980 - 1994 1981 - 1994 … 1994 - 1994
1st do loop (2nd iteration): Group 1 1977 - 1978 1977 - 1978 1977 - 1978 … 1977 - 1978
2nd do loop: Group 2 1979 - 1979 1979 - 1980 1979 - 1981 … 1979 - 1993
Else Group 3 1980 - 1994 1981 - 1994 1982 - 1994 … 1994 - 1994
...
1st do loop (n-1th iteration) Group 1 1977 - 1991 1977 - 1991
2nd do loop: Group 2 1992 - 1992 1992 - 1993
Else Group 3 1993 - 1994 1994 - 1994
1st do loop (nth iteration) Group 1 1977 - 1992
2nd do loop: Group 2 1993 - 1993
Else Group 3 1994 - 1994
然后我就选择分组设置,其提供3组(组内响应)的最小方差的总和。
这是一个手册,详尽办法。 这应该解决您的问题,指出,但如果你想要更多的团体,或有更大的数据是没有接近问题的一个好办法。
我敢肯定有使用立即特效,但没有发展出来的一个更明智的做法。
/* Get the year bounds */
proc sql noprint;
select min(year), max(year)
into :yMin, :yMax
from maindat;
quit;
/* Get all the boundaries */
data cutoffs;
do min = &yMin. to &yMax.;
do max = min + 1 to &yMax. + 1;
output;
end;
end;
run;
proc sql;
/* Calculate all the variances */
create table vars as
select
a.*,
var(b.Response) as var
from cutoffs as a
left join maindat as b
on a.min <= b.year < a.max
group by a.min, a.max;
/* Get the sum of the variances for each set of 3 groups */
create table want as
select
a.min as a,
b.min as b,
c.min as c,
c.max as d,
sum(a.var, b.var, c.var) as sumVar
from vars as a
left join vars as b
on a.max = b.min
left join vars as c
on b.max = c.min
where a.min = &yMin. and c.max = &yMax. and a.var and b.var and c.var
order by a.min, b.min, c.min;
/* Output your answer (combine with previous step if you don't want the list) */
select *
from want
where sumVar in (select min(sumVar) from want);
quit;
SRSwift的答案可能是你所提供的问题,最好的一个。 与标准算法,这里的难点是,你似乎并不有你的功能(响应的方差)的单一局部/全局最小,但有多个局部极小,导致它无法与相对较低的灵活性,工作非常好它具有与数据密度来调整。 这种事情很容易,如果你有很多的“年”可以的,你不是在一个时间跳过围绕1年跳过围绕五年或十位,以解决或什么(为了避免局部极小); 但只有一对夫妇十几年那是不切实际的。
这是一个核心的机器学习应用,集群节点的能力,并拥有多项解决方案。 您的特定人似乎吸引最简单的,一个我在一门课程,几年前的教训和找到很容易实现,如果你在几块想起来了。
- 定义要尽量减少,也就是说,minim_f功能。
- 定义一个函数,它接受您的数据,修改群集的质心(或任何限定簇)在一个方向上,也就是说,modif_f为一个质心通过量小。 (质心和方向应该是参数。)
然后调用minim_f和modif_f交替; 你叫minim_f,抓住它的价值,呼吁modif_f与一组参数; 然后检查minim_f,看看情况是否好转。 如果是这样,跟上这个方向去。 如果不是,从以前的迭代恢复到原始值,并尝试在modif_f不同的修改。 一直走,直到你找到当地的最低,这是希望全球最低。
这样做的确切机制而变化; 特别是,你可能会在一次调整一个或多个质心,你要搞清楚,以保持调整,直到没有更多的调整,将工作的正确途径。
我写这为您的数据的一个小例子; 它得出了同样的答案SRSwift的,虽然PROC来计算方差是不一样的,从SRSwift的程序。 我不是一个统计学家,不会说哪个是对的,但他们显然足够的工作同样,这并不重要。 煤矿是一个非常简单的实现这一点,并会提高从大大受益,但希望它解释了基本概念。
data maindat;
input Year Response ;
datalines;
1994 -4.300511714
1994 -9.646920963
1994 -15.86956805
1993 -16.14857235
1993 -13.05797186
1993 -13.80941206
1992 -3.521394503
1992 -1.102526302
1992 -0.137573583
1992 2.669238665
1992 -9.540489193
1992 -19.27474303
1992 -3.527077011
1991 1.676464068
1991 -2.238822314
1991 4.663079037
1991 -5.346920963
1990 -8.543723186
1990 0.507460641
1990 0.995302284
1990 0.464194011
1989 4.728791571
1989 5.578685423
1988 2.771297564
1988 7.109159247
1987 15.96059456
1987 2.985292226
1986 -4.301136971
1985 5.854674875
1985 5.797294021
1984 4.393329025
1983 -6.622580905
1982 0.268500302
1977 12.23062252
;
run;
proc sort data=maindat;
by year;
run;
proc freq data=maindat; * Start us off with a frequency table by year.;
tables year/out=yearfreq outcum;
run;
data initial_clusters; * Guess that the best starting point is 1/3 of the years for each cluster.;
set yearfreq;
cluster = floor(cum_pct/33.334)+1;
run;
data cluster_years; * Merge on the clusters;
merge maindat initial_clusters(keep=year cluster);
by year;
run;
proc means data=cluster_years; * And get that starting variance.;
class cluster;
types cluster;
var response;
output out=cluster_var var=;
run;
data cluster_var_tot; * Create our starting 'cumulative' file of variances;
set cluster_var end=eof;
total_var+response;
iter=1;
if eof then output;
keep total_var iter;
run;
data current_clusters; * And initialize the current cluster estimate to the initial clusters;
set initial_clusters;
run;
* Here is our recursive cluster-testing macro.;
%macro try_cluster(cluster_adj=, cluster_new=,iter=1);
/* Here I include both MODIF_F and MINIM_F, largely because variable scoping is irritating if I separate them. */
/* But you can easily swap out the MINIM_F portion if needed to a different minimization function. */
/* This is MODIF_F, basically */
data adjusted_clusters;
set current_clusters;
by cluster;
%if &cluster_adj. < &cluster_new. %then %do;
if last.cluster
%end;
%else %do;
if first.cluster
%end;
and cluster=&cluster_adj. then cluster=&cluster_new.;
run;
data cluster_years;
merge maindat adjusted_clusters(keep=year cluster);
by year;
run;
/* end MODIF_F */
/* This would be MINIM_F if it were a function of its own */
proc means data=cluster_years noprint; *Calculate variance by cluster;
class cluster;
types cluster;
var response;
output out=cluster_var var=;
run;
data cluster_var_tot;
set cluster_var_tot cluster_var indsname=dsn end=eof;
retain last_var last_iter;
if dsn='WORK.CLUSTER_VAR_TOT' then do; *Keep the old cluster variances for history;
output;
last_var=total_var;
last_iter=_n_;
end;
else do; *Sum up the variance for this iteration;
total_var+response;
iter=last_iter+1;
if eof then do;
if last_var > total_var then smaller=1; *If it is smaller...;
else smaller=0;
call symputx('smaller',smaller,'l'); *save smaller to a macro variable;
if smaller=1 then output; *... then output it.;
end;
end;
keep total_var iter;
run;
/* End MINIM_F */
%if &smaller=1 %then %do; *If this iteration was better, then keep iterating, otherwise stop;
data current_clusters;
set adjusted_clusters; *replace old clusters with better clusters;
run;
%if &iter<10 %then %try_cluster(cluster_adj=&cluster_adj.,cluster_new=&cluster_new.,iter=&iter.+1);
%end;
%mend try_cluster;
* Let us try a few changes;
%try_cluster(cluster_adj=1,cluster_new=2,iter=1);
%try_cluster(cluster_adj=2,cluster_new=1,iter=1);
%try_cluster(cluster_adj=3,cluster_new=2,iter=1);
* That was just an example (that happens to work for this data);
* This part would be greatly enhanced by some iteration testing and/or data-appropriate modifications;
* Now merge back on the 'current' clusters, since the current cluster_years is actually one worse;
data cluster_years;
merge maindat current_clusters(keep=year cluster);
by year;
run;
* And get the variance just as a verification.;
proc means data=cluster_years;
class cluster;
types cluster;
var response;
output out=cluster_var var=;
run;