I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help
and some other posts, I have tried using the duplicated
function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table
package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x
variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [
.
EDIT for fun here's a microbenchmark
of the answers given so far with rle
being mine, consec
being what I think is the most fundamentally direct answer, given by @James, and would be the answer I would "accept", and dp
being the dplyr
answer given by @Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle
performs better than I thought it would.
A cheap solution with dplyr
that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table
solution. The trick is to create a shifted version of x
with the shift
function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x
in order for correct handling of the last value.