I need to run a regression on a panel data . It has 3 dimensions (Year * Company * Country). For example:
============================================
year | comp | count | value.x | value.y
------+------+-------+----------+-----------
2000 | A | USA | 1029.0 | 239481
------+------+-------+----------+-----------
2000 | A | CAN | 2341.4 | 129333
------+------+-------+----------+-----------
2000 | B | USA | 2847.7 | 187319
------+------+-------+----------+-----------
2000 | B | CAN | 4820.5 | 392039
------+------+-------+----------+-----------
2001 | A | USA | 7289.9 | 429481
------+------+-------+----------+-----------
2001 | A | CAN | 5067.3 | 589143
------+------+-------+----------+-----------
2001 | B | USA | 7847.8 | 958234
------+------+-------+----------+-----------
2001 | B | CAN | 9820.0 | 1029385
============================================
However, the R package plm
seems not able to cope with more than 2 dimension.
I have tried
result <- plm(value.y ~ value.x, data = dataname, index = c("comp","count","year"))
and it returns error:
Error in pdata.frame(data, index) :
'index' can be of length 2 at the most (one individual and one time index)
How do you run regressions when the panel data (individual * time) has more than 1 dimension within "individual"?
In case anyone encounters the same situation, I'll put my solutions here:
R seems unable to cope with this situation. And the only thing you can do is to add dummies. If the categorical variables according to which you add dummies contains too much categories, you can try this:
makedummy <- function(colnum,data,interaction = FALSE,interation_varnum)
{
char0 = colnames(data)[colnum]
char1 = "dummy"
tmp = unique(data[,colnum])
valname = paste(char0,char1,tmp,sep = ".")
valname_int = paste(char0,char1,"int",tmp,sep = ".")
for(i in 1:(length(tmp)-1))
{
if(!interaction)
{
tmp_dummy <- ifelse(data[,colnum]==tmp[i],1,0)
}
if(interaction)
{
index = apply(as.matrix(data[,colnum]),1,identical,y = tmp[i])
tmp_dummy = c()
tmp_dummy[index] = data[index,interation_varnum]
tmp_dummy[!index] = 0
}
tmp_dummy <- data.frame(tmp_dummy)
if(!interaction)
{
colnames(tmp_dummy) <- valname[i]
}
if(interaction)
{
colnames(tmp_dummy) <- valname_int[i]
}
data<-cbind(data,tmp_dummy)
}
return(data)
}
for example:
## Create fake data
fakedata <- matrix(rnorm(300),nrow = 100)
cate <- LETTERS[sample(seq(1,10),100, replace = TRUE)]
fakedata <- cbind.data.frame(cate,fakedata)
## Try this
fakedata <- makedummy(1,fakedata)
## If you need to add dummy*x to see if there is any influences of different categories on the coefficients, try this
fakedata <- makedummy(1,fakedata,interaction = TRUE,interaction_varnum = 2)
Maybe a little bit verbose here, I didn't polish it. Any advice is welcome. Now you can perform OLS on your data.
I think you want to use
lm()
instead ofplm(
). This blog post here discusses what you're after:https://www.r-bloggers.com/r-tutorial-series-multiple-linear-regression/
for your example I'd imagine it would look something like the following:
I think you can also do:
And then estimate
This question is much like these:
You may not want to create a new dummy, then with dplyr package you can use the
group_indices
function. Although it do not supportmutate
, the following approach is straightforward:The
id
variable will be your first panel dimension. So, you need to set the plm index argument toindex = c("id", "year")
.For alternatives you can take a look at this question: R create ID within a group.
If you want to control for another dimension in a within model, simply add a dummy for it:
plm(value.y ~ value.x + count, data = dataname, index = c("comp","year"))
Alternatively (especially for high-dimensional data), look at the
lfe
package which can 'absorb' the additional dimension so the summary output is not polluted by the dummy variable.