-->

R - overcome “cut” ignoring values outside of rang

2019-08-28 05:44发布

问题:

I am comparing two years worth of daily soil moisture (SM) measurements. In one year, SM ranged from 0 to 0.6. In the other year, which had more rain, SM ranged from 0 to 0.8. Amongst the data, I also have some NA's, where the SM sensor did not work for some reason. Let's re-create something similar:

library(data.table)
set.seed(24)
dt1 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"), 
                  sm=sample(c(NA, runif(10, min=0, max=0.6)), 365, replace = TRUE))

dt2 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"), 
                  sm=sample(c(NA, runif(10, min=0, max=0.8)), 365, replace = TRUE))

I am trying to compare both datasets based on the proportion of values between classes of SM in each month. The classes I am interested in are seq(0, 0.8, by=0.2). I also need to count the proportion of failed measurements (NA) per month.

I managed to do that by using akrun's helpful answer here: R - Calculate percentage of occurrences in data.table by month

tmp1 <- dt1[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
                                    by=.(month=month(date),
                                         grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
                                                 labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]

tmp2 <- dt2[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
                                    by=.(month=month(date),
                                         grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
                                                 labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]

However, the output is not exactly what I expect. Since values in dt1 range only from 0 to 0.6, there is no 0.6-0.8 category at all in the resulting data table tmp1.

It looks like cut ignores the last category (0.6-0.8) because there is no SM measurement within that range. This makes my comparison really inconvenient, because I don't have the same groups in the resulting data tables tmp1 and tmp2.

Does anybody know how to fix this, i.e. how to "force" cut to consider values outside the break range? I need all SM categories in both tmp1 and tmp2, even if their count is 0.

Just as a reference, this issue does not happen if we use table, which always shows all categories even if their count is zero:

t1 <- runif(10, 0, 0.6)
t2 <- runif(10, 0, 0.8)

table(cut(t1, breaks=seq(0, 0.8, by=0.2)))

  (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] 
        5         3         2         0 
table(cut(t2, breaks=seq(0, 0.8, by=0.2)))

  (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] 
        1         3         2         4 

Any input appreciated.

回答1:

Use CJ to count all levels, even those that don't appear in the table:

f = function(d){

    # create month column
    d[, month := month(date)]

    # roll to make cut-group column
    mdt = data.table(sm = c(NA, seq(0, .8, by=.2)))
    d[, lb := mdt[.SD, on=.(sm), roll=TRUE, x.sm]]

    # join with CJ to ensure all levels are present
    res = d[CJ(month = month, lb = mdt$sm, unique = TRUE), on=.(month, lb), .N, by=.EACHI]

    # rescale to monthly pct
    res[, pct := N/sum(N), by=month][]

}

# try it
f(dt1)
f(dt2)

You could also do this with cut. The important thing is how you're tabulating results, not how you're grouping them...