I created a toy example of my code below.
In this toy example I would like to create a measure of all higher prices minus lower prices within a self-created reference group. So within each reference group, I would like to take each individual and subtract its price value from all higher price values from other individuals in the same group. I do not want to have negative differences. Then I would like to sum all these differences. In creating this code I found some help here:
http://www.stata.com/support/faqs/data-management/try-all-values-with-foreach/
However, the code didn't work perfectly for me, because my dataset is quite large (several 100K obs) and the examples on the website and my code only work until the numlist maximum of 1600 in Stata. (I am using version 12). The toy example with the auto dataset works, due to small size of the dataset.
I would like to ask if someone has an idea how to code this more efficiently, so that I can get around the numlist restriction. I thought about summing the differences directly without saving them in intermediate variables, but that also blow up the numlist restriction.
clear all
sysuse auto
ren headroom refgroup
bysort refgroup : egen pricerank = rank(price)
qui: su pricerank, meanonly
gen test = `r(max)'
su test
foreach i of num 1/`r(max)' {
qui: bys refgroup: gen intermediate`i' = price[_n+`i'] -price if price[_n+`i'] > price
}
egen price_diff = rowmax(intermediate*)
drop intermediate*
If I understand this correctly, this isn't even a problem that requires explicit loops. The sum of all higher prices is just the difference between two cumulative sums. You might need to think through what you want to do if prices are tied.
. clear
. set obs 10
obs was 0, now 10
. gen group = _n > 5
. set seed 2803
. gen price = ceil(1000 * runiform())
. bysort group (price) : gen sumhigherprices = sum(price)
. by group : replace sumhigherprices = sumhigherprices[_N] - sumhigherprices
(10 real changes made)
. list
+--------------------------+
| group price sumhig~s |
|--------------------------|
1. | 0 218 1448 |
2. | 0 264 1184 |
3. | 0 301 883 |
4. | 0 335 548 |
5. | 0 548 0 |
|--------------------------|
6. | 1 125 3027 |
7. | 1 213 2814 |
8. | 1 828 1986 |
9. | 1 988 998 |
10. | 1 998 0 |
+--------------------------+
Edit: For what the OP needs, there is an extra line
. by group : replace sumhigherprices = sumhigherprices - (_N - _n) * price
If I understand the wording of the problem correctly, maybe this can help. It uses joinby
(new observations are created and depending on the size of the original database, you may or not hit the Stata hard-limit on number of observations). The code reproduces the results that would follow from the code of the original post. This is a second attempt. The code before this final edit did not provide the sought-after results. The wording of the problem was somewhat difficult for me to understand.
clear all
set more off
* Load data
sysuse auto
* Delete unnecessary vars
ren headroom refgroup
keep refgroup price
* Generate id´s based on rankings (sort)
bysort refgroup (price): gen id = _n
* Pretty list
order refgroup id
sort refgroup id price
list, sepby(refgroup)
* joinby procedure
tempfile main
save "`main'"
rename (price id) =0
joinby refgroup using "`main'"
list, sepby(refgroup)
* Do not compare with itself and drop duplicates
drop if id0 >= id
* Compute differences and max
gen dif = abs(price0 - price)
collapse (max) dif, by(refgroup id0)
list, sepby(refgroup)