I have a sample dataset of user ids and months in which a transaction was made. My goal is to calculate, month over month, how many of the original users made transactions. In other words, how many users that were new in January also made transactions in February, March, and April. How many users that were new in February made transactions in March and April, and so on.
> data
date user_id
1 Jan 2017 1
2 Jan 2017 2
3 Jan 2017 3
4 Jan 2017 4
5 Jan 2017 5
6 Feb 2017 1
7 Feb 2017 3
8 Feb 2017 5
9 Feb 2017 7
10 Feb 2017 9
11 Mar 2017 2
12 Mar 2017 4
13 Mar 2017 6
14 Mar 2017 8
15 Mar 2017 10
16 Apr 2017 1
17 Apr 2017 3
18 Apr 2017 6
19 Apr 2017 9
20 Apr 2017 12
The output of this dataset would look something like this:
> output
Jan Feb Mar Apr
Jan 5 3 2 2
Feb NA 2 0 1
Mar NA NA 3 1
Apr NA NA NA 1
So far the only way I can think of doing this is to split the dataset and then calculate the unique ids for each month that are not present in the previous months, but this method is verbose and is not suited for a large dataset with many months.
subsets <-split(data, data$date, drop=TRUE)
for (i in 1:length(subsets)) {
assign(paste0("M", i), as.data.frame(subsets[[i]]))
}
M1_ids <- unique(M1$user_id)
M2_ids <- unique(M2$user_id)
M3_ids <- unique(M3$user_id)
M4_ids <- unique(M4$user_id)
M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))
Is there a way in R to come up with the above output with a shorter method using dplyr
or even base R? The real data set has many years and months.
The format of the data is as follows:
> sapply(data, class)
date user_id
"yearmon" "integer"
And the sample data:
> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
This is possible within the Tidyverse functions, too:
Here's an example: