Following up from my question here, I am trying to replicate in R the functionality of the Stata command duplicates tag
, which allows me to tag all the rows of a dataset that are duplicates in terms of a given set of variables:
clear *
set obs 16
g f1 = _n
expand 104
bys f1: g f2 = _n
expand 2
bys f1 f2: g f3 = _n
expand 41
bys f1 f2 f3: g f4 = _n
des // describe the dataset in memory
preserve
sample 10 // draw a 10% random sample
tempfile sampledata
save `sampledata', replace
restore
// append the duplicate rows to the data
append using `sampledata'
sort f1-f4
duplicates tag f1-f4, generate(dupvar)
browse if dupvar == 1 // check that all duplicate rows have been tagged
Edit
Here is what Stata produces (added on @Arun's request):
f1 f2 f3 f4 dupvar
1 1 1 1 0
1 1 1 2 0
1 1 1 3 1
1 1 1 3 1
1 1 1 4 0
1 1 1 5 0
1 1 1 6 0
1 1 1 7 0
1 1 1 8 1
1 1 1 8 1
Note that for (f1, f2, f3, f4) = (1, 1, 1, 3)
there are two rows, and both of those are marked dupvar = 1
. Similarly, for the two rows that are duplicates for (f1, f2, f3, f4) =(1, 1, 1, 8)
.
R:
The base function duplicated
tags only the second duplicate onwards. So, I wrote a function to replicate the Stata functionality in R, using ddply
.
# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
f2 = factor(1:41),
f3 = factor(1:2),
f4 = factor(1:104))
# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])
# dummy data
dfDup$data = rnorm(nrow(dfDup))
# function: use ddply to tag all duplicate rows in the data
fnDupTag = function(dfX, indexVars) {
dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
if(nrow(x) > 1) x$dup = 1 else x$dup = 0
return(x)
})
return(dfDupTag)
}
# test the function
indexVars = paste0('f', 1:4, sep = '')
dfTemp = fnDupTag(dfDup, indexVars)
But as in the linked question, performance is a huge issue. Another possible solution is
dfDup$dup = duplicated(dfDup[, indexVars]) |
duplicated(dfDup[, indexVars], fromLast = TRUE)
dfDupSorted = with(dfDup, dfDup[order(eval(parse(text = indexVars))), ])
I have a few questions:
1. Is it possible to make the ddply
version faster?
2. Is the second version using duplicated
correct? For more than two copies of the duplicated rows?
3. How would I do this using data.table
? Would that be faster?
I'll answer your third question here.. (I think the first question is more or less answered in your other post).
## Assuming DT is your data.table
DT[, dupvar := 1L*(.N > 1L), by=c(indexVars)]
:=
adds a new column dupvar
by reference (and is therefore very fast because no copies are made). .N
is a special variable within data.table
, that provides the number of observations that belong to each group (here, for every f1,f2,f3,f4
).
Take your time and go through ?data.table
(and run the examples there) to understand the usage. It'll save you a lot of time later on.
So, basically, we group by indexVars
, check if .N > 1L
and if it's the case, it'd return TRUE
. We multiply by 1L
to return an integer
instead of logical
value.
If you require, you can also sort it by the by-columns using setkey
.
From the next version on (currently implemented in v1.9.3 - development version), there's also a function setorder
that's exported that just sorts the data.table
by reference, without setting keys. It also can sort in ascending or descending order. (Note that setkey
always sorts in ascending order only).
That is, in the next version you can do:
setorder(DT, f1, f2, f3, f4)
## or equivalently
setorderv(DT, c("f1", "f2", "f3", "f4"))
In addition, the usage DT[order(...)]
is also optimised internally to use data.table
's fast ordering. That is, DT[order(...)]
is detected internally and changed to DT[forder(DT, ...)]
which is incredibly faster than base's order
. So, if you don't want to change it by reference, and want to assign the sorted data.table
on to another variable, you can just do:
DT_sorted <- DT[order(f1, f2, f3, f4)] ## internally optimised for speed
## but still copies!
HTH
I don't really have an answer to your three questions, but I can save you some time. I also split time between Stata and R and often miss Stata's duplicates
commands. But if you subset
then merge
with all=TRUE
, then you can save a lot of time.
Here's an example.
# my more Stata-ish approach
system.time({
dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
dupes$dup <- 1
dfTemp2 <- merge(dfDup, dupes, all=TRUE)
dfTemp2$dup <- ifelse(is.na(dfTemp2$dup), 0, dfTemp2$dup)
})
This is quite a bit faster.
> system.time({
+ fnDupTag = function(dfX, indexVars) {
+ dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
+ if(nrow(x) > 1) x .... [TRUNCATED]
user system elapsed
118.75 0.22 120.11
> # my more Stata-ish approach
> system.time({
+ dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
+ dupes$dup <- 1
+ dfTemp2 <- merge(dfDup, .... [TRUNCATED]
user system elapsed
0.63 0.00 0.63
With identical results (subject to all.equal
's precision).
> # compare
> dfTemp <- dfTemp[with(dfTemp, order(f1, f2, f3, f4, data)), ]
> dfTemp2 <- dfTemp2[with(dfTemp2, order(f1, f2, f3, f4, data)), ]
> all.equal(dfTemp, dfTemp2)
[1] "Attributes: < Component 2: Mean relative difference: 1.529748e-05 >"