Count distinct in a rxSummary

2019-09-09 15:48发布

I want to count distinct values of var2 grouping by var1 in a .xdf file,

I tried something like this

 myFun <- function(dataList) {
    UniqueLevel <<- unique(c(UniqueLevel, dataList$var2))
    SumUniqueLevel <<- length(UniqueLevel)
    return(NULL)
    }

rxSummary(formula = ~ var1,
data = "DefModelo2.xdf",
transformFunc = myFun,
transformObjects = list(UniqueLevel = NULL),
removeZeroCounts = F)

Thank you in advance

EDIT:

Probably using RevoPemaR is the the faster way

2条回答
疯言疯语
2楼-- · 2019-09-09 16:06

Split by var1, and then for each group, count up the unique values of var2. This assumes that var1 and var2 are factors, if they're not you'll have to run rxFactors first.

xdflst <- rxSplit(xdf, splitByVars="var1", varsToKeep=c("var1", "var2"))

out <- rxExec(function(grp) {
        var1 <- head(grp, 1)$var1
        var2 <- rxDataStep(grp, varsToKeep="var2")$var2
        data.frame(var2, distinct=length(unique(var2)))
    },
    grp=rxElemArg(xdflst))

do.call(rbind, out)

Or you could get my dplyrXdf package and use a dplyr group_by/summarise pipeline (which basically does all the above, including converting to factors if necessary):

xdf %>% group_by(var1) %>%
    summarise(distinct=n_distinct(var2),
              .rxArgs=list(varsToKeep=c("var1", "var2")))
查看更多
Juvenile、少年°
3楼-- · 2019-09-09 16:24

One other option is to use rxCrossTabs. This way you get a cross-tabulation of the two factors, and you can just count non zero entries to determine unique values by one of the factors.

censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
censusXtabAge <- rxCrossTabs(~ F(age):F(wkswork1), data = censusWorkers, 
                             removeZeroCounts = FALSE, returnXtabs = TRUE)
apply(censusXtabAge != 0, MARGIN = 1, sum)
查看更多
登录 后发表回答