A simplified version of my data set would look like:
depth value
1 a
1 b
2 a
2 b
2 b
3 c
I would like to make a new data set where, for each value of "depth", I would have the cumulative number of unique values, starting from the top. e.g.
depth cumsum
1 2
2 2
3 3
Any ideas as to how to do this? I am relatively new to R.
I find this a perfect case of using factor
and setting levels
carefully. I'll use data.table
here with this idea. Make sure your value
column is character
(not an absolute requirement).
step 1: Get your data.frame
converted to data.table
by taking just unique
rows.
require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth") # just to be sure before factoring "value"
step 2: Convert value
to a factor
and coerce to numeric
. Make sure to set the levels yourself (it is important).
dt[, id := as.numeric(factor(value, levels = unique(value)))]
step 3: Set key column to depth
for subsetting and just pick the last value
setkey(dt, "depth", "id")
dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL]
# depth id
# 1: 1 2
# 2: 2 2
# 3: 3 3
step 4: Since all values in the rows with increasing depth should have at least the value of the previous row, you should use cummax
to get the final output.
dt.out[, id := cummax(id)]
Edit: The above code was for illustrative purposes. In reality you don't need a 3rd column at all. This is how I'd write the final code.
require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth")
dt[, value := as.numeric(factor(value, levels = unique(value)))]
setkey(dt, "depth", "value")
dt.out <- dt[J(unique(depth)), mult="last"]
dt.out[, value := cummax(value)]
Here's a more tricky example and the output from the code:
df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6),
value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L),
.Label = c("a", "b", "c", "d", "f", "g"), class = "factor")),
.Names = c("depth", "value"), row.names = c(NA, -11L),
class = "data.frame")
# depth value
# 1: 1 2
# 2: 2 4
# 3: 3 4
# 4: 4 5
# 5: 5 6
# 6: 6 6
Here is another attempt:
numvals <- cummax(as.numeric(factor(mydf$value)))
aggregate(numvals, list(depth=mydf$depth), max)
Which gives:
depth x
1 1 2
2 2 2
3 3 3
It seems to work with @Arun's example, too:
depth x
1 1 2
2 2 4
3 3 4
4 4 5
5 5 6
6 6 6
This can be written in a relatively clean fashion with a single SQL statement using the sqldf package. Assume DF
is the original data frame:
library(sqldf)
sqldf("select b.depth, count(distinct a.value) as cumsum
from DF a join DF b
on a.depth <= b.depth
group by b.depth"
)
A good first step would be to create a column of TRUE
or FALSE
, where it is TRUE
for the first of each value and FALSE
for later appearances of that value. This can be done easily using duplicated
:
mydata$first.appearance = !duplicated(mydata$value)
Reshaping the data is best done using aggregate
. In this case, it says to sum over the first.appearance
column within each subset of depth
:
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)
The result will look like:
depth first.appearance
1 1 2
2 2 0
3 3 1
This is still not a cumulative sum, though. For that you can use the cumsum
function (and then get rid of your old column):
newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL
So to recap:
mydata$first.appearance = !duplicated(mydata$value)
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)
newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL
Output:
depth cumsum
1 1 2
2 2 2
3 3 3
Here is another solution using lapply()
. With unique(df$depth)
make vector of unique depth
values and then for each such value subset only those value
values where depth
is equal or less than particular depth
value. Then calculate length of unique value
values. This length value is stored in cumsum
, then depth=x
will give value of particular depth level. With do.call(rbind,...)
make it as one data frame.
do.call(rbind,lapply(unique(df$depth),
function(x)
data.frame(depth=x,cumsum=length(unique(df$value[df$depth<=x])))))
depth cumsum
1 1 2
2 2 2
3 3 3
A dplyr attempt.
df %>%
#group_by(group)%>% if you have a third variable and you want to achieve the same results for each group
mutate(cum_unique_entries = cumsum(!duplicated(value))) %>%
group_by(depth) %>% # add group variable for more layers
summarise(cum_unique_entries = last(cum_unique_entries))