R's duplicated
returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated
will give me the vector
FALSE, FALSE, FALSE, TRUE, TRUE
But in this case I actually want to get
FALSE, FALSE, TRUE, TRUE, TRUE
that is, I want to know whether a row is duplicated by a row with a larger subscript too.
Duplicated rows in a dataframe could be obtained with
dplyr
by doingTo exclude certain columns
group_by_at(vars(-var1, -var2))
could be used instead to group the data.If the row indices and not just the data is actually needed, you could add them first as in:
You need to assemble the set of
duplicated
values, applyunique
, and then test with%in%
. As always, a sample problem will make this process come alive.I've had the same question, and if I'm not mistaken, this is also an answer.
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
Adding a count variable with dplyr:
For duplicate rows (considering all columns):
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
duplicated
has afromLast
argument. The "Example" section of?duplicated
shows you how to use it. Just callduplicated
twice, once withfromLast=FALSE
and once withfromLast=TRUE
and take the rows where either areTRUE
.Some late Edit: You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums