I'm trying to look into a large data set that denotes a person's career history in a firm. I want to see the maximum number of years a person worked as a Manager
, under the condition that this person was in a Sales
category prior to becoming a Boss
(regardless of how many years prior this was). The data looks like the following: job2
is a dummy variable indicating whether the person was a Manager
, cumu_job2
denotes the cumulative years a person was in a Manager
position (only sequential cumulation is considered).
id name year job job2 cumu_job2
1 Jane 1980 Worker 0 0
1 Jane 1981 Manager 1 1
1 Jane 1982 Sales 0 0
1 Jane 1983 Sales 0 0
1 Jane 1984 Manager 1 1
1 Jane 1985 Manager 1 2
1 Jane 1986 Boss 0 0
2 Bob 1985 Worker 0 0
2 Bob 1986 Sales 0 0
2 Bob 1987 Manager 1 1
2 Bob 1988 Manager 1 2
2 Bob 1989 Boss 0 0
By extracting the maximum years a person worked, under the condition that the person had history of working in Sales
I would want the data to have another column that denotes this information:
id name year job job2 cumu_job2 cumu_max
1 Jane 1983 Sales 0 0
1 Jane 1986 Boss 0 2
2 Bob 1986 Sales 0 0
2 Bob 1989 Boss 0 2
So I believe this requires two steps - I first need to only extract the case when person move from Sales
to Boss
, and then store the maximum value for each person in new vector cumu_max
based on cumu_job2
.
This is a complex process, so any suggestions would be very much appreciated...!
I have considered why the answer below using dplyr
does not work, and here is what I think - the example showed that all people became boss only after becoming Manager, but I also have data points that looks like for Kevin
:
id name year job job2 cumu_job2
1 Jane 1980 Worker 0 0
1 Jane 1981 Manager 1 1
1 Jane 1982 Sales 0 0
1 Jane 1983 Sales 0 0
1 Jane 1984 Manager 1 1
1 Jane 1985 Manager 1 2
1 Jane 1986 Boss 0 0
2 Bob 1985 Worker 0 0
2 Bob 1986 Sales 0 0
2 Bob 1987 Manager 1 1
2 Bob 1988 Manager 1 2
2 Bob 1989 Boss 0 0
3 Kevin 1991 Manager 1 1
3 Kevin 1992 Manager 1 2
3 Kevin 1993 Sales 0 0
4 Kevin 1994 Boss 0 0
So in the end, I would want
id name year job cumu_job2 cumu_max
1 Jane 1983 Sales 0 0
1 Jane 1986 Boss 0 2
2 Bob 1986 Sales 0 0
2 Bob 1989 Boss 0 2
3 Kevin 1993 Sales 0 2
3 Kevin 1994 Boss 0 2
The dplyr solution only spits out the ones who went from Sales - Manager - Boss without taking into account the possibility of Manager - Sales - Boss (which is more observed in my data set).
This may not cover all cases in your actual data but does (mostly) what you are looking for. Note that I added Jill who should be excluded according to your conditions.