Count observations within dynamic range

2019-02-27 16:52发布

问题:

Consider the following example:

input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end

gen date = mdy(month,day,year) 
format date %td
drop day month year

For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.

Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.

EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End

I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.

I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.

Note: My "inefficient" code looks as follows

gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
        count if inrange(date-date[`i'],1,365) & group == group[`i']
        replace count = r(N) in `i'
}
sort group date

回答1:

I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.

sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0

I can't promise this will be faster than a loop but I suspect that it will be.



回答2:

The question is not completely clear.

Consider the following data with two different results, num2 and num3:

  +-----------------------------------------+
  |     date2   group   treat   num2   num3 |
  |-----------------------------------------|
  | 01feb2000       1       1      3      2 |
  | 01jun2000       1       0      .      . |
  | 01sep2000       1       0      .      . |
  | 01nov2000       1       1      0      0 |
  | 01may2002       1       0      .      . |
  | 01jan2010       1       1      1      1 |
  | 01may2010       1       0      .      . |
  |-----------------------------------------|
  | 01jan2001       2       1      0      0 |
  | 01mar2002       2       1      0      0 |
  +-----------------------------------------+

The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.

The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.

num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:

clear
set more off

input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end

list

gen date2 = date(date,"DMY")
format date2 %td
drop date count num

order date
list, sepby(group)


*----- what you want -----

gen num2 = .
isid group date, sort

forvalues j = 1/`=_N' {

    count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j'] 

    replace num2 = r(N) in `j'
}

replace num2 = . if !treat

list, sepby(group)

num3 is computed with code similar in spirit (and results) as that posted by @jfeigenbaum:

<snip>

*----- what you want -----

isid group date, sort
by group: gen indicat = sum(treat)

sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat

list, sepby(group)

Even more than two interpretations are possible for your problem, but I'll leave it at that.

(Note that I have changed your example data to include cases that probably make the problem more realistic.)



标签: stata