So I tried to build the automated grouping. The goal is to select the grouping setting that has the lowest variance.
In other word, I want to find x and y for the following, x,y are natural number,
GROUP 1: 1997 - x
GROUP 2: x+1 - y
GROUP 3: y+1 - 1994
such that the SUM of (variance(Response
in Group1),variance(Response
in Group2),variance(Response
in Group3)) are minimize.
data maindat;
input Year Response ;
datalines;
1994 -4.300511714
1994 -9.646920963
1994 -15.86956805
1993 -16.14857235
1993 -13.05797186
1993 -13.80941206
1992 -3.521394503
1992 -1.102526302
1992 -0.137573583
1992 2.669238665
1992 -9.540489193
1992 -19.27474303
1992 -3.527077011
1991 1.676464068
1991 -2.238822314
1991 4.663079037
1991 -5.346920963
1990 -8.543723186
1990 0.507460641
1990 0.995302284
1990 0.464194011
1989 4.728791571
1989 5.578685423
1988 2.771297564
1988 7.109159247
1987 15.96059456
1987 2.985292226
1986 -4.301136971
1985 5.854674875
1985 5.797294021
1984 4.393329025
1983 -6.622580905
1982 0.268500302
1977 12.23062252
;
run;
My idea is that I'll have 2 do loop (nested)
1st do loop (1st iteration): Group 1 1977 - 1977 1977 - 1977 1977 - 1977 … 1977 - 1977
2nd do loop: Group 2 1978 - 1978 1978 - 1979 1978 - 1980 … 1978 - 1993
Else: Group 3 1979 - 1994 1980 - 1994 1981 - 1994 … 1994 - 1994
1st do loop (2nd iteration): Group 1 1977 - 1978 1977 - 1978 1977 - 1978 … 1977 - 1978
2nd do loop: Group 2 1979 - 1979 1979 - 1980 1979 - 1981 … 1979 - 1993
Else Group 3 1980 - 1994 1981 - 1994 1982 - 1994 … 1994 - 1994
...
1st do loop (n-1th iteration) Group 1 1977 - 1991 1977 - 1991
2nd do loop: Group 2 1992 - 1992 1992 - 1993
Else Group 3 1993 - 1994 1994 - 1994
1st do loop (nth iteration) Group 1 1977 - 1992
2nd do loop: Group 2 1993 - 1993
Else Group 3 1994 - 1994
Then I'll just select the grouping setting that provide the smallest of the sum of the variance(response within the group) of 3 groups.
Here is a manual, exhaustive approach. This should solve your problem as stated, but is not a good way of approaching the problem if you want more groups, or have larger data.
I'm sure there is a more sensible approach using one of the procs but nothing springs to mind immediately.
/* Get the year bounds */
proc sql noprint;
select min(year), max(year)
into :yMin, :yMax
from maindat;
quit;
/* Get all the boundaries */
data cutoffs;
do min = &yMin. to &yMax.;
do max = min + 1 to &yMax. + 1;
output;
end;
end;
run;
proc sql;
/* Calculate all the variances */
create table vars as
select
a.*,
var(b.Response) as var
from cutoffs as a
left join maindat as b
on a.min <= b.year < a.max
group by a.min, a.max;
/* Get the sum of the variances for each set of 3 groups */
create table want as
select
a.min as a,
b.min as b,
c.min as c,
c.max as d,
sum(a.var, b.var, c.var) as sumVar
from vars as a
left join vars as b
on a.max = b.min
left join vars as c
on b.max = c.min
where a.min = &yMin. and c.max = &yMax. and a.var and b.var and c.var
order by a.min, b.min, c.min;
/* Output your answer (combine with previous step if you don't want the list) */
select *
from want
where sumVar in (select min(sumVar) from want);
quit;
SRSwift's answer may be the best one for the problem you provided. The difficulty here with the standard algorithm is that you don't seem to have a single local/global minimum of your function (variance of response), but have multiple local minima that cause it to not work terribly well with the relatively low flexibility it has with the data density to adjust. This sort of thing is easy to work around if you have a lot of 'years', where you can instead of skipping around one year at a time skip around by five years or ten or whatever (to avoid local minima); but with only a couple dozen years that is impractical.
This is a core machine learning application, the ability to cluster nodes, and has a number of solutions. Your particular one seems to appeal to the most simple, one I learned in a course a few years ago and find very easy to implement if you think of it in a few pieces.
- Define a function that you want to minimize, say, minim_f.
- Define a function that takes your data, modifies the cluster centroid (or whatever defines a cluster) for one centroid by a small amount in one direction, say, modif_f. (Centroid and Direction should be parameters.)
Then you call minim_f and modif_f alternately; you call minim_f, grab its value, call modif_f with one set of parameters; then check minim_f and see if it's better. If so, keep going with that direction. If not, revert back to the original values from the previous iteration and try a different modification in modif_f. Keep going until you've found the local minimum, which is hopefully a global minimum.
The exact mechanics of this vary; in particular, you might adjust one or more centroids at once, and you have to figure out the right way to keep adjusting until no more adjustments will work.
I wrote a small example of this for your data; it does come to the same answer as SRSwift's, although the proc means calculated variance is not the same as that from SRSwift's program. I am not a statistician and won't say which is right, but they clearly work sufficiently similarly that it's not important. Mine is a very simple implementation of this and would benefit greatly from improvement, but hopefully it explains the basic concepts.
data maindat;
input Year Response ;
datalines;
1994 -4.300511714
1994 -9.646920963
1994 -15.86956805
1993 -16.14857235
1993 -13.05797186
1993 -13.80941206
1992 -3.521394503
1992 -1.102526302
1992 -0.137573583
1992 2.669238665
1992 -9.540489193
1992 -19.27474303
1992 -3.527077011
1991 1.676464068
1991 -2.238822314
1991 4.663079037
1991 -5.346920963
1990 -8.543723186
1990 0.507460641
1990 0.995302284
1990 0.464194011
1989 4.728791571
1989 5.578685423
1988 2.771297564
1988 7.109159247
1987 15.96059456
1987 2.985292226
1986 -4.301136971
1985 5.854674875
1985 5.797294021
1984 4.393329025
1983 -6.622580905
1982 0.268500302
1977 12.23062252
;
run;
proc sort data=maindat;
by year;
run;
proc freq data=maindat; * Start us off with a frequency table by year.;
tables year/out=yearfreq outcum;
run;
data initial_clusters; * Guess that the best starting point is 1/3 of the years for each cluster.;
set yearfreq;
cluster = floor(cum_pct/33.334)+1;
run;
data cluster_years; * Merge on the clusters;
merge maindat initial_clusters(keep=year cluster);
by year;
run;
proc means data=cluster_years; * And get that starting variance.;
class cluster;
types cluster;
var response;
output out=cluster_var var=;
run;
data cluster_var_tot; * Create our starting 'cumulative' file of variances;
set cluster_var end=eof;
total_var+response;
iter=1;
if eof then output;
keep total_var iter;
run;
data current_clusters; * And initialize the current cluster estimate to the initial clusters;
set initial_clusters;
run;
* Here is our recursive cluster-testing macro.;
%macro try_cluster(cluster_adj=, cluster_new=,iter=1);
/* Here I include both MODIF_F and MINIM_F, largely because variable scoping is irritating if I separate them. */
/* But you can easily swap out the MINIM_F portion if needed to a different minimization function. */
/* This is MODIF_F, basically */
data adjusted_clusters;
set current_clusters;
by cluster;
%if &cluster_adj. < &cluster_new. %then %do;
if last.cluster
%end;
%else %do;
if first.cluster
%end;
and cluster=&cluster_adj. then cluster=&cluster_new.;
run;
data cluster_years;
merge maindat adjusted_clusters(keep=year cluster);
by year;
run;
/* end MODIF_F */
/* This would be MINIM_F if it were a function of its own */
proc means data=cluster_years noprint; *Calculate variance by cluster;
class cluster;
types cluster;
var response;
output out=cluster_var var=;
run;
data cluster_var_tot;
set cluster_var_tot cluster_var indsname=dsn end=eof;
retain last_var last_iter;
if dsn='WORK.CLUSTER_VAR_TOT' then do; *Keep the old cluster variances for history;
output;
last_var=total_var;
last_iter=_n_;
end;
else do; *Sum up the variance for this iteration;
total_var+response;
iter=last_iter+1;
if eof then do;
if last_var > total_var then smaller=1; *If it is smaller...;
else smaller=0;
call symputx('smaller',smaller,'l'); *save smaller to a macro variable;
if smaller=1 then output; *... then output it.;
end;
end;
keep total_var iter;
run;
/* End MINIM_F */
%if &smaller=1 %then %do; *If this iteration was better, then keep iterating, otherwise stop;
data current_clusters;
set adjusted_clusters; *replace old clusters with better clusters;
run;
%if &iter<10 %then %try_cluster(cluster_adj=&cluster_adj.,cluster_new=&cluster_new.,iter=&iter.+1);
%end;
%mend try_cluster;
* Let us try a few changes;
%try_cluster(cluster_adj=1,cluster_new=2,iter=1);
%try_cluster(cluster_adj=2,cluster_new=1,iter=1);
%try_cluster(cluster_adj=3,cluster_new=2,iter=1);
* That was just an example (that happens to work for this data);
* This part would be greatly enhanced by some iteration testing and/or data-appropriate modifications;
* Now merge back on the 'current' clusters, since the current cluster_years is actually one worse;
data cluster_years;
merge maindat current_clusters(keep=year cluster);
by year;
run;
* And get the variance just as a verification.;
proc means data=cluster_years;
class cluster;
types cluster;
var response;
output out=cluster_var var=;
run;