Count distinct in a rxSummary

2019-09-09 15:35发布

问题:

I want to count distinct values of var2 grouping by var1 in a .xdf file,

I tried something like this

 myFun <- function(dataList) {
    UniqueLevel <<- unique(c(UniqueLevel, dataList$var2))
    SumUniqueLevel <<- length(UniqueLevel)
    return(NULL)
    }

rxSummary(formula = ~ var1,
data = "DefModelo2.xdf",
transformFunc = myFun,
transformObjects = list(UniqueLevel = NULL),
removeZeroCounts = F)

Thank you in advance

EDIT:

Probably using RevoPemaR is the the faster way

回答1:

One other option is to use rxCrossTabs. This way you get a cross-tabulation of the two factors, and you can just count non zero entries to determine unique values by one of the factors.

censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
censusXtabAge <- rxCrossTabs(~ F(age):F(wkswork1), data = censusWorkers, 
                             removeZeroCounts = FALSE, returnXtabs = TRUE)
apply(censusXtabAge != 0, MARGIN = 1, sum)


回答2:

Split by var1, and then for each group, count up the unique values of var2. This assumes that var1 and var2 are factors, if they're not you'll have to run rxFactors first.

xdflst <- rxSplit(xdf, splitByVars="var1", varsToKeep=c("var1", "var2"))

out <- rxExec(function(grp) {
        var1 <- head(grp, 1)$var1
        var2 <- rxDataStep(grp, varsToKeep="var2")$var2
        data.frame(var2, distinct=length(unique(var2)))
    },
    grp=rxElemArg(xdflst))

do.call(rbind, out)

Or you could get my dplyrXdf package and use a dplyr group_by/summarise pipeline (which basically does all the above, including converting to factors if necessary):

xdf %>% group_by(var1) %>%
    summarise(distinct=n_distinct(var2),
              .rxArgs=list(varsToKeep=c("var1", "var2")))