I am trying to write a function that allows me to find multiple first occurences of an event in a given year. Events happen to different firms at different moments in time. So an event might happen for the first time in 1980 to firm c and afterwards in 1981 to firm b. In that case, all i need to find is firm c_1980 and the associated value in the matrix.
If however, an event does NOT happen UNTIL it happens to firm a in 1986 and to firm e in 1986 as well, then I need to find as outcome both a_1986 and e_1986 with their respective values in the matrix.
My (2500 * 800) matrix has 2500 different events on the vertical axis and 800 different year_firm combinations on the horizontal one. All values are between 0 and 10 (in the real matrix, in the example between 0 and 2) with the vast majority being zeros.
Example data:
av<-matrix(rep(0:2),10,40)
av[1:7,]=0 ; av[9,3:14]=0
av[,c(22,38)]=1
colnames(av)<-paste(c("a","b","c","d","e"),rep(1980:1987, each=5),sep="_")
col.av<-colnames(av)
rownames(av)<-paste("X",1:10,sep="")
row.av<-rownames(av)
The main formula I have been using gives the positions in the matrix of the first occurence:
first<-max.col(av>0,"first")
This works fine to find the first occurence. However, as the data show, sometimes that are multiple occurences in the same year (e.g. event in row 8 occurs in 1980 for firms a, b, d, and e -> given that this is the first year in which row 8 becomes non-zero I would need to find 4 different values as output).
My code so far is basically a patch-all solution
avdum1<-matrix(cbind(seq(1:nrow(av)),first),nrow=nrow(av),ncol=2)
avdum2<-matrix(cbind(row.av,first),nrow=nrow(av),ncol=2)
Using avdum1 and avdum2 as well as the original row and column names I can then devise a matrix that gives me the first occurence in the original matrix, together with the exact value of the first occurence (1 or 2) as well as the firm-year combination
firsttime<-matrix(cbind(row.av,col.av[first],av[avdum1]),nrow=nrow(av),ncol=3)
So far so good. Now, to find other first occurences in the same year, what I do is
av[avdum1]<-0
This places the original first occurences at zero and then I run through the entire process again, to then expand the firsttime matrix, split up the column names in years and firm names a,b,c,d,e, compare the years and see whether the second first occurence happened in the same year. If it did, I have to rerun the entire process a third time and so on (my real dataset has 40 firms).
This becomes pretty cumbersome so I'm wondering if there is a smarter way to do it? Maybe a localized search once a positive event has been spotted based on the relative position of that event in the matrix?
The final desired outcome:
(if you copy the example data you can ignore the initial warning when producing the matrix)
for rows 1 to 7 , the result would be b_1984 with value 1 for row 8, the result should be a_1980 with 1, b_1980 with 2, d_1980 with 1 and e_1980 with 2 for row 9, a_1980 with 2 for row 10, b_1980 with 1, c_1980 with 2, and e_1980 with 1
Hopefully this clarifies some of the previous questions/comments
Any suggestions would be very welcome!
I gave it a shot, although I followed a bit different path than yours. Perhaps, there could be a way to manipulate your data as is to give the result (and maybe, even, fast) but I preferred to use a "long" format instead. A long format can, also, be fastly manipulated with packages like "data.table" and "dplyr".
Firstly, I transformed your
av
to a long format of the following format:From here on, I guess there would be many different and more efficient approaches, but I could -only- come up with the following:
For each 3rd dimension you could search for 1)which values ([row, column]) are above 0 and 2) which of them are in the minimum column available (i.e. the event occured in an earlier year). An implementation of this could be the following function:
And apply the function to each 3rd dimension; i.e. to each event: