As a follow-up to this oldie but goodie: efficient row-wise operations on a data.table
I have some data that (unfortunately) look like:
library('data.table')
set.seed(1234)
m <- 5
n <- 7
rb <- function() runif(m,1000,2000) * rbinom(m,1,0.5)
series_col_nms <- paste0('YearNo',1:n)
rev <- data.table(cust_id = paste0('CustNo',1:m),
other_stuff = sample(letters,m, replace=TRUE))
for(col in series_col_nms){
set(rev, j=col, value=rb())
}
setkey(rev, cust_id)
One row per customer with various columns including yearly revenue for year 1, 2, ...
I would like to get the year indices for the first and last years with any revenue from each customer.
I can produce the desired result, but with a somewhat hacky join:
years_active <- rev[, which(.SD>0), .SDcols = series_col_nms,
keyby=cust_id][, .(min_year_active = min(V1),
max_year_active = max(V1)), keyby=cust_id]
years_active[rev]
These attempts to get the minimum index fail:
rev[, apply(.SD, 1, function(x) min(which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns data type error
rev[, do.call(pmin, lapply(.SD, function(x) which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns empty
What is the data.table
way to do this?
Reshaping I would store the data in long format:
You lose customer 5 with the
revenue > 0
condition, but I doubt that matters.Then collect aggregate statistics as needed:
Parsing out the numbers from the strings you've been using is simple, of course.
max.col I think this is a bad kludge but...
You'd have to go back and fill in the special case of all zeros (customer 5) separately.
When you want to operate on many columns by row, the common practice it to
melt
your data set first and then operate on a single column.In your case, a relatively straightforward solution could be something like
A cleaner version but with warnings could be