R: extract maximum value in vector under certain c

2019-07-31 09:27发布

I'm trying to look into a large data set that denotes a person's career history in a firm. I want to see the maximum number of years a person worked as a Manager, under the condition that this person was in a Sales category prior to becoming a Boss (regardless of how many years prior this was). The data looks like the following: job2 is a dummy variable indicating whether the person was a Manager, cumu_job2 denotes the cumulative years a person was in a Manager position (only sequential cumulation is considered).

  id    name    year    job    job2 cumu_job2
    1   Jane    1980    Worker  0   0
    1   Jane    1981    Manager 1   1
    1   Jane    1982    Sales   0   0
    1   Jane    1983    Sales   0   0
    1   Jane    1984    Manager 1   1
    1   Jane    1985    Manager 1   2
    1   Jane    1986    Boss    0   0
    2   Bob     1985    Worker  0   0
    2   Bob     1986    Sales   0   0
    2   Bob     1987    Manager 1   1
    2   Bob     1988    Manager 1   2
    2   Bob     1989    Boss    0   0

By extracting the maximum years a person worked, under the condition that the person had history of working in Sales I would want the data to have another column that denotes this information:

id  name    year    job    job2    cumu_job2 cumu_max 
        1   Jane    1983    Sales       0       0
        1   Jane    1986    Boss        0       2
        2   Bob     1986    Sales       0       0 
        2   Bob     1989    Boss        0       2

So I believe this requires two steps - I first need to only extract the case when person move from Sales to Boss, and then store the maximum value for each person in new vector cumu_max based on cumu_job2.

This is a complex process, so any suggestions would be very much appreciated...!

I have considered why the answer below using dplyr does not work, and here is what I think - the example showed that all people became boss only after becoming Manager, but I also have data points that looks like for Kevin:

id  name    year    job    job2 cumu_job2
        1   Jane    1980    Worker  0   0
        1   Jane    1981    Manager 1   1
        1   Jane    1982    Sales   0   0
        1   Jane    1983    Sales   0   0
        1   Jane    1984    Manager 1   1
        1   Jane    1985    Manager 1   2
        1   Jane    1986    Boss    0   0
        2   Bob     1985    Worker  0   0
        2   Bob     1986    Sales   0   0
        2   Bob     1987    Manager 1   1
        2   Bob     1988    Manager 1   2
        2   Bob     1989    Boss    0   0
        3   Kevin   1991    Manager 1   1
        3   Kevin   1992    Manager 1   2
        3   Kevin   1993    Sales   0   0
        4   Kevin   1994    Boss    0   0

So in the end, I would want

 id name    year    job    cumu_job2 cumu_max 
  1 Jane    1983    Sales       0       0
  1 Jane    1986    Boss        0       2
  2 Bob     1986    Sales       0       0 
  2 Bob     1989    Boss        0       2
  3 Kevin   1993    Sales       0       2
  3 Kevin   1994    Boss        0       2

The dplyr solution only spits out the ones who went from Sales - Manager - Boss without taking into account the possibility of Manager - Sales - Boss (which is more observed in my data set).

1条回答
别忘想泡老子
2楼-- · 2019-07-31 10:11

This may not cover all cases in your actual data but does (mostly) what you are looking for. Note that I added Jill who should be excluded according to your conditions.

require(dplyr)
dat <- read.table(header = TRUE, text = "id    name    year    job    job2 cumu_job2
1   Jane    1980    Worker  0   0
1   Jane    1981    Manager 1   1
1   Jane    1982    Sales   0   0
1   Jane    1983    Sales   0   0
1   Jane    1984    Manager 1   1
1   Jane    1985    Manager 1   2
1   Jane    1986    Boss    0   0
2   Bob     1985    Worker  0   0
2   Bob     1986    Sales   0   0
2   Bob     1987    Manager 1   1
2   Bob     1988    Manager 1   2
2   Bob     1989    Boss    0   0
3   Jill    1989    Worker  0   0
3   Jill    1990    Boss    0   0")

dat %.%
  group_by(id) %.%
  mutate(
    all_jobs = sum(unique(job) %in% c("Sales","Manager","Boss")),
    cumu_max = max(cumu_job2)
  ) %.%
  filter(all_jobs == 3, job %in% c("Sales","Boss"))

Source: local data frame [5 x 8]
Groups: id

  id name year   job job2 cumu_job2 all_jobs cumu_max
1  1 Jane 1982 Sales    0         0        3        2
2  1 Jane 1983 Sales    0         0        3        2
3  1 Jane 1986  Boss    0         0        3        2
4  2  Bob 1986 Sales    0         0        3        2
5  2  Bob 1989  Boss    0         0        3        2
查看更多
登录 后发表回答