How to Randomly Assign to Groups of Different Size

2019-08-17 01:09发布

问题:

Say I have a dataset and I want to assign observations to different groups, the size of groups determined by the data. For example, suppose that this is the data:

sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
*Create global with regions
global region NE NCntrl South West
*Count the number in each region
bys reg (pop): gen reg_N=_N
tab reg

There are four reg groups, all of different sizes. Now, I want to randomly assign observations to the four groups. This is accomplished below by generating a random number and then assigning observations to one of the groups based on the random number.

*Generate random number
set seed 1
gen random = runiform()
sort random
*Assign observations to number based on random sorting
egen reg_rand = seq(), from(1) to (4)
*Map number to region
gen reg_new = ""
global count 1
foreach i in $region {
    replace reg_new = "`i'" if reg_rand==$count
    global count = $count + 1
}
bys reg_new: gen reg_new_N = _N
tab reg_new

This is not what I want, though. Instead of using the seq() command, which creates groups of equal sizes (assuming N divided by number of groups is a whole number), I would like to randomly assign based on the size of the original groups. In this case, that is equivalent to reg_N. For example, there would be 12 observations that have a reg_new value of NCntrl.

I might have one solution similar to https://stats.idre.ucla.edu/stata/faq/how-can-i-randomly-assign-observations-to-groups-in-stata/. The idea would be to save the results of tab reg into a macro or matrix, and then use a loop and replace to cycle through the observations, which are sorted by a random number. Assume that there are many, many more groups than the four in this toy example. Is there a more reasonable way to accomplish this?

回答1:

It looks like you want to shuffle around the values stored in a group variable across observations. You can do this by reducing the data to the group variable, sorting on a variable that contains random values and then using an unmatched merge to associate the random group identifiers to the original observations. Assuming that the data example is stored in a file called "data_example.dta" and is currently loaded into memory, this would look like:

set seed 234
keep reg
rename reg reg_new
gen double u = runiform()
sort u reg_new
merge 1:1 _n  using "data_example.dta", nogen

tab reg reg_new