R: Searching & isolating multiple first occurences

I am trying to write a function that allows me to find multiple first occurences of an event in a given year. Events happen to different firms at different moments in time. So an event might happen for the first time in 1980 to firm c and afterwards in 1981 to firm b. In that case, all i need to find is firm c_1980 and the associated value in the matrix.

If however, an event does NOT happen UNTIL it happens to firm a in 1986 and to firm e in 1986 as well, then I need to find as outcome both a_1986 and e_1986 with their respective values in the matrix.

My (2500 * 800) matrix has 2500 different events on the vertical axis and 800 different year_firm combinations on the horizontal one. All values are between 0 and 10 (in the real matrix, in the example between 0 and 2) with the vast majority being zeros.

Example data:

av<-matrix(rep(0:2),10,40)
av[1:7,]=0 ; av[9,3:14]=0
av[,c(22,38)]=1
colnames(av)<-paste(c("a","b","c","d","e"),rep(1980:1987, each=5),sep="_")
col.av<-colnames(av)
rownames(av)<-paste("X",1:10,sep="")
row.av<-rownames(av)

The main formula I have been using gives the positions in the matrix of the first occurence:

first<-max.col(av>0,"first")

This works fine to find the first occurence. However, as the data show, sometimes that are multiple occurences in the same year (e.g. event in row 8 occurs in 1980 for firms a, b, d, and e -> given that this is the first year in which row 8 becomes non-zero I would need to find 4 different values as output).

My code so far is basically a patch-all solution

avdum1<-matrix(cbind(seq(1:nrow(av)),first),nrow=nrow(av),ncol=2)
avdum2<-matrix(cbind(row.av,first),nrow=nrow(av),ncol=2)

Using avdum1 and avdum2 as well as the original row and column names I can then devise a matrix that gives me the first occurence in the original matrix, together with the exact value of the first occurence (1 or 2) as well as the firm-year combination

firsttime<-matrix(cbind(row.av,col.av[first],av[avdum1]),nrow=nrow(av),ncol=3)

So far so good. Now, to find other first occurences in the same year, what I do is

av[avdum1]<-0

This places the original first occurences at zero and then I run through the entire process again, to then expand the firsttime matrix, split up the column names in years and firm names a,b,c,d,e, compare the years and see whether the second first occurence happened in the same year. If it did, I have to rerun the entire process a third time and so on (my real dataset has 40 firms).

This becomes pretty cumbersome so I'm wondering if there is a smarter way to do it? Maybe a localized search once a positive event has been spotted based on the relative position of that event in the matrix?

The final desired outcome:

(if you copy the example data you can ignore the initial warning when producing the matrix)

for rows 1 to 7 , the result would be b_1984 with value 1 for row 8, the result should be a_1980 with 1, b_1980 with 2, d_1980 with 1 and e_1980 with 2 for row 9, a_1980 with 2 for row 10, b_1980 with 1, c_1980 with 2, and e_1980 with 1

Hopefully this clarifies some of the previous questions/comments

Any suggestions would be very welcome!

标签： r search matrix time-series

1条回答

唯我独甜

2楼-- · 2019-08-20 03:55

I gave it a shot, although I followed a bit different path than yours. Perhaps, there could be a way to manipulate your data as is to give the result (and maybe, even, fast) but I preferred to use a "long" format instead. A long format can, also, be fastly manipulated with packages like "data.table" and "dplyr".

Firstly, I transformed your av to a long format of the following format:

#turn to long format
long_DF = as.data.frame(as.table(av), responseName = "value")

#tidy up
tmp = do.call(rbind.data.frame, strsplit(as.character(long_DF[[2]]), "_"))
long_DF$firm = tmp[, 1] ; long_DF$year = tmp[, 2] 
long_DF$event = long_DF[[1]] ; long_DF = long_DF[-(1:2)]

long_DF[c(1,4,5,8,15,16,20), ]
#   value firm year event
#1      0    a 1980    X1
#4      0    a 1980    X4
#5      0    a 1980    X5
#8      1    a 1980    X8
#15     0    b 1980    X5
#16     0    b 1980    X6
#20     1    b 1980   X10

From here on, I guess there would be many different and more efficient approaches, but I could -only- come up with the following:

#3D array
res = xtabs(value ~ firm + year + event, long_DF)

res[, , 3, drop = F]
#, , event = X3
#
#    year
#firm 1980 1981 1982 1983 1984 1985 1986 1987
#   a    0    0    0    0    0    0    0    0
#   b    0    0    0    0    1    0    0    0
#   c    0    0    0    0    0    0    0    1
#   d    0    0    0    0    0    0    0    0
#   e    0    0    0    0    0    0    0    0

For each 3rd dimension you could search for 1)which values ([row, column]) are above 0 and 2) which of them are in the minimum column available (i.e. the event occured in an earlier year). An implementation of this could be the following function:

#function to apply to each 3rd dimension
f1 = function(x) {
      wh = which(x > 0, arr.ind = T)
      wh2 = which(wh[, "col"] == min(wh[, "col"]))
      wh3 = wh[wh2, , drop = F]
      cbind.data.frame(firm = rownames(x)[wh3[, 1]],
                       year = colnames(x)[wh3[, 2]], 
                       val = x[wh3])
}

And apply the function to each 3rd dimension; i.e. to each event:

ret = apply(res, 3, f1)
#ret
ans = cbind.data.frame(event = rep(names(ret), lapply(ret, nrow)), 
                       do.call(rbind.data.frame, apply(res, 3, f1)))
ans
#      event firm year val
#X1       X1    b 1984   1
#X2       X2    b 1984   1
#X3       X3    b 1984   1
#X4       X4    b 1984   1
#X5       X5    b 1984   1
#X6       X6    b 1984   1
#X7       X7    b 1984   1
#X8.1     X8    a 1980   1
#X8.2     X8    b 1980   2
#X8.3     X8    d 1980   1
#X8.4     X8    e 1980   2
#X9       X9    a 1980   2
#X10.1   X10    b 1980   1
#X10.2   X10    c 1980   2
#X10.3   X10    e 1980   1

0人赞添加讨论(0) 举报

R: Searching & isolating multiple first occurences

The final desired outcome:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间