I want to count
distinct values of var2 grouping by var1 in a .xdf file,
I tried something like this
myFun <- function(dataList) {
UniqueLevel <<- unique(c(UniqueLevel, dataList$var2))
SumUniqueLevel <<- length(UniqueLevel)
return(NULL)
}
rxSummary(formula = ~ var1,
data = "DefModelo2.xdf",
transformFunc = myFun,
transformObjects = list(UniqueLevel = NULL),
removeZeroCounts = F)
Thank you in advance
EDIT:
Probably using RevoPemaR is the the faster way
One other option is to use rxCrossTabs
. This way you get a cross-tabulation of the two factors, and you can just count non zero entries to determine unique values by one of the factors.
censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
censusXtabAge <- rxCrossTabs(~ F(age):F(wkswork1), data = censusWorkers,
removeZeroCounts = FALSE, returnXtabs = TRUE)
apply(censusXtabAge != 0, MARGIN = 1, sum)
Split by var1
, and then for each group, count up the unique values of var2
. This assumes that var1
and var2
are factors, if they're not you'll have to run rxFactors
first.
xdflst <- rxSplit(xdf, splitByVars="var1", varsToKeep=c("var1", "var2"))
out <- rxExec(function(grp) {
var1 <- head(grp, 1)$var1
var2 <- rxDataStep(grp, varsToKeep="var2")$var2
data.frame(var2, distinct=length(unique(var2)))
},
grp=rxElemArg(xdflst))
do.call(rbind, out)
Or you could get my dplyrXdf package and use a dplyr group_by/summarise pipeline (which basically does all the above, including converting to factors if necessary):
xdf %>% group_by(var1) %>%
summarise(distinct=n_distinct(var2),
.rxArgs=list(varsToKeep=c("var1", "var2")))