I have a data set that looks like this
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
Here, job2
denotes a dummy variable indicating whether a person was a Manager
during that year or not. I want to do two things to this data set: first, I only want to preserve the row when the person became Boss
for the first time. Second, I would like to see cumulative years a person worked as a Manager
and store this information in the variable cumu_job2
. Thus I would like to have:
id name year job job2 cumu_job2
1 Jane 1980 Worker 0 0
1 Jane 1981 Manager 1 1
1 Jane 1982 Manager 1 2
1 Jane 1983 Manager 1 3
1 Jane 1984 Manager 1 4
1 Jane 1985 Manager 1 5
1 Jane 1986 Boss 0 0
2 Bob 1985 Worker 0 0
2 Bob 1986 Worker 0 0
2 Bob 1987 Manager 1 1
2 Bob 1988 Boss 0 0
I have changed my examples and included the Worker position because this reflects more what I want to do with the original data set. The answers in this thread only works when there are only Managers and Boss in the data set - so any suggestions for making this work would be great. I'll be very much grateful!!
Contributed by Matthew Dowle:
Explanation
.SD
)Older versions:
You have two different split apply combines here. One to get the cumulative jobs, and the other to get the first row of boss status. Here is an implementation in
data.table
where we basically do each analysis separately (well, kind of), and then collect everything in one place withrbind
. The main thing to note is theby=id
piece, which basically means the other expressions are evaluated for eachid
grouping in the data, which was what you correctly noted was missing from your attempt.Note this assumes table is sorted by year within each
id
, but if it isn't that's easy enough to fix.Alternatively you could also achieve the same with:
The idea is to basically get the row numbers where the condition matches (with
.I
- internal variable) and then subsetdt
on those row numbers (the$v1
part), then just perform the cumulative sum.Here is a base solution using
within
andave
. We assume that the input isDF
and that the data is sorted as in the question.REVISION: Now uses
within
.I think this does what you want, although the data must be sorted as you have presented it.
Here is the succinct
dplyr
solution for the same problem.NOTE: Make sure that
stringsAsFactors = FALSE
while reading in the data.Output:
Explanation
cumu_job2
column.@BrodieG's is way better:
The Data
#The code: