Cumulative count of unique values in R

2019-02-04 12:45发布

问题:

A simplified version of my data set would look like:

depth value
   1     a
   1     b
   2     a
   2     b
   2     b
   3     c

I would like to make a new data set where, for each value of "depth", I would have the cumulative number of unique values, starting from the top. e.g.

depth cumsum
 1      2
 2      2
 3      3

Any ideas as to how to do this? I am relatively new to R.

回答1:

I find this a perfect case of using factor and setting levels carefully. I'll use data.table here with this idea. Make sure your value column is character (not an absolute requirement).

  • step 1: Get your data.frame converted to data.table by taking just unique rows.

    require(data.table)
    dt <- as.data.table(unique(df))
    setkey(dt, "depth") # just to be sure before factoring "value"
    
  • step 2: Convert value to a factor and coerce to numeric. Make sure to set the levels yourself (it is important).

    dt[, id := as.numeric(factor(value, levels = unique(value)))]
    
  • step 3: Set key column to depth for subsetting and just pick the last value

     setkey(dt, "depth", "id")
     dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL]
    
    #    depth id
    # 1:     1  2
    # 2:     2  2
    # 3:     3  3
    
  • step 4: Since all values in the rows with increasing depth should have at least the value of the previous row, you should use cummax to get the final output.

    dt.out[, id := cummax(id)]
    

Edit: The above code was for illustrative purposes. In reality you don't need a 3rd column at all. This is how I'd write the final code.

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth")
dt[, value := as.numeric(factor(value, levels = unique(value)))]
setkey(dt, "depth", "value")
dt.out <- dt[J(unique(depth)), mult="last"]
dt.out[, value := cummax(value)]

Here's a more tricky example and the output from the code:

df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6), 
                value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L), 
                .Label = c("a", "b", "c", "d", "f", "g"), class = "factor")), 
                .Names = c("depth", "value"), row.names = c(NA, -11L), 
                class = "data.frame")
#    depth value
# 1:     1     2
# 2:     2     4
# 3:     3     4
# 4:     4     5
# 5:     5     6
# 6:     6     6


回答2:

Here is another attempt:

numvals <- cummax(as.numeric(factor(mydf$value)))
aggregate(numvals, list(depth=mydf$depth), max)

Which gives:

  depth x
1     1 2
2     2 2
3     3 3

It seems to work with @Arun's example, too:

  depth x
1     1 2
2     2 4
3     3 4
4     4 5
5     5 6
6     6 6


回答3:

This can be written in a relatively clean fashion with a single SQL statement using the sqldf package. Assume DF is the original data frame:

library(sqldf)

sqldf("select b.depth, count(distinct a.value) as cumsum
    from DF a join DF b 
    on a.depth <= b.depth
    group by b.depth"
)


回答4:

A good first step would be to create a column of TRUE or FALSE, where it is TRUE for the first of each value and FALSE for later appearances of that value. This can be done easily using duplicated:

mydata$first.appearance = !duplicated(mydata$value)

Reshaping the data is best done using aggregate. In this case, it says to sum over the first.appearance column within each subset of depth:

newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)

The result will look like:

  depth first.appearance
1     1  2
2     2  0
3     3  1

This is still not a cumulative sum, though. For that you can use the cumsum function (and then get rid of your old column):

newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

So to recap:

mydata$first.appearance = !duplicated(mydata$value)
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)
newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

Output:

  depth cumsum
1     1      2
2     2      2
3     3      3


回答5:

Here is another solution using lapply(). With unique(df$depth) make vector of unique depth values and then for each such value subset only those value values where depth is equal or less than particular depth value. Then calculate length of unique value values. This length value is stored in cumsum, then depth=x will give value of particular depth level. With do.call(rbind,...) make it as one data frame.

do.call(rbind,lapply(unique(df$depth), 
               function(x)
             data.frame(depth=x,cumsum=length(unique(df$value[df$depth<=x])))))
  depth cumsum
1     1      2
2     2      2
3     3      3


回答6:

A dplyr attempt.

df %>%
  #group_by(group)%>% if you have a third variable and you want to achieve the same results for each group

  mutate(cum_unique_entries = cumsum(!duplicated(value))) %>%
  group_by(depth) %>% # add group variable for more layers
  summarise(cum_unique_entries = last(cum_unique_entries))