I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir
, are either -1, 0, or 1.
dir.rle <- rle(df$dir)
I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently.
ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
l <- dir.rle$lengths[i] - 1
s <- ndx
e <- ndx+l
tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
ndx <- e + 1
}
The run lengths of dir
define the start, s
, and end, e
, for each run. The above code works but it does not feel like idiomatic R code. I feel as if there should be another way to do it without the loop.
Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).
I'll just throw in an alternative approach to forming that grouping variable. It doesn't use
rle
and, at least to me, feels more intuitive. Basically, at each point wherediff()
detects a change in value, thecumsum
that will form your grouping variable is incremented by one:This can be broken down into a two step problem. First, if we create an indexing column based off of the
rle
, then we can use that to group by and run thecumsum
. The group by can then be performed by any number of aggregation techniques. I'll show two options, one usingdata.table
and the other usingplyr
.Add a 'group' column to the data frame. Something like:
then use tapply to sum within groups: