How to segment a data.frame according to certain a

2019-02-28 05:21发布

问题:

This question already has an answer here:

  • Consecutive group number in R 3 answers

I am currently dealing with a car data. We recorded the speed of the car every 5 minutes, and it contains a lot of zero values. My question is, how to segment the data by a zero values and give each non-zero section a ordered number in R? Let's take a sample data as example:

sample <- data.frame(
  id = 1:15, 
  speed = c(50,0, 0, 30, 50, 40,0, 0, 25, 30, 50, 0, 30, 50, 40))

I want to add a new column that gives each non-zero section a number (starting from 1), while a consecutive number of k zero speeds (or more) is numbered as 0. Specifically for this sample data, let's say k equals 2, then my desired result should be like this dataframe:

sample_new <- data.frame(
  id = 1:15, 
  speed = c(50,0, 0, 0, 50, 40,0, 0, 25, 30, 50, 0, 30, 50, 40), 
  number = c(1, 0, 0, 0, 2, 2, 0 ,0, 3, 3, 3, 3, 3, 3, 3))

which prints as

   id speed number
1   1    50      1
2   2     0      0
3   3     0      0
4   4     0      0
5   5    50      2
6   6    40      2
7   7     0      0
8   8     0      0
9   9    25      3
10 10    30      3
11 11    50      3
12 12     0      3** <- here is the difference
13 13    30      3
14 14    50      3
15 15    40      3

There are more than 1 million rows in my data, so I hope that the solution could be acceptable in speed.

The reason for setting a threshold "k" is that, some drivers just leave their GPS open even if they lock the car and go to sleep. But in other occasion where the interval is less than k, they just stopped because of the crossroad light. I want to focus on the longtime stops and just ignore the short time stops.

Hope my question makes sense to you.Thank you.

回答1:

As processing speed is a concern for the production data set of more than 1 M rows, I suggest to use data.table.

It's quite easy to identify the groups of subsequent non-zero entries:

library(data.table)
setDT(sample)[, number := rleid(speed > 0 ) * (speed > 0)][]
    id speed number
 1:  1    50      1
 2:  2     0      0
 3:  3     0      0
 4:  4    30      3
 5:  5    50      3
 6:  6    40      3
 7:  7     0      0
 8:  8     0      0
 9:  9    25      5
10: 10    30      5
11: 11    50      5
12: 12     0      0
13: 13    30      7
14: 14    50      7
15: 15    40      7

The group numbers are different but aren't numbered consecutively. If this is a requirement it will get tricky:

setDT(sample)[, number := as.integer(factor(rleid(speed > 0 ) * (speed > 0), exclude = 0))][]
    id speed number
 1:  1    50      1
 2:  2     0     NA
 3:  3     0     NA
 4:  4    30      2
 5:  5    50      2
 6:  6    40      2
 7:  7     0     NA
 8:  8     0     NA
 9:  9    25      3
10: 10    30      3
11: 11    50      3
12: 12     0     NA
13: 13    30      4
14: 14    50      4
15: 15    40      4

If really required, the NAs can be replaced by 0 with

setDT(sample)[, number := as.integer(factor(rleid(speed > 0 ) * (speed > 0), exclude = 0))][
  is.na(number), number := 0][]

There is an alternative approach

setDT(sample)[, number := {
  tmp <- speed > 0
  cumsum(tmp  - shift(tmp, fill = 0, type = "lag") > 0) * tmp
}][]
    id speed number
 1:  1    50      1
 2:  2     0      0
 3:  3     0      0
 4:  4    30      2
 5:  5    50      2
 6:  6    40      2
 7:  7     0      0
 8:  8     0      0
 9:  9    25      3
10: 10    30      3
11: 11    50      3
12: 12     0      0
13: 13    30      4
14: 14    50      4
15: 15    40      4


标签: r loops order