tracking a cohort over time in R

I have a sample dataset of user ids and months in which a transaction was made. My goal is to calculate, month over month, how many of the original users made transactions. In other words, how many users that were new in January also made transactions in February, March, and April. How many users that were new in February made transactions in March and April, and so on.

> data
       date user_id
1  Jan 2017       1
2  Jan 2017       2
3  Jan 2017       3
4  Jan 2017       4
5  Jan 2017       5
6  Feb 2017       1
7  Feb 2017       3
8  Feb 2017       5
9  Feb 2017       7
10 Feb 2017       9
11 Mar 2017       2
12 Mar 2017       4
13 Mar 2017       6
14 Mar 2017       8
15 Mar 2017      10
16 Apr 2017       1
17 Apr 2017       3
18 Apr 2017       6
19 Apr 2017       9
20 Apr 2017      12

The output of this dataset would look something like this:

> output
    Jan Feb Mar Apr
Jan   5   3   2   2
Feb  NA   2   0   1
Mar  NA  NA   3   1
Apr  NA  NA  NA   1

So far the only way I can think of doing this is to split the dataset and then calculate the unique ids for each month that are not present in the previous months, but this method is verbose and is not suited for a large dataset with many months.

subsets <-split(data, data$date, drop=TRUE)

for (i in 1:length(subsets)) {
  assign(paste0("M", i), as.data.frame(subsets[[i]]))
}

M1_ids <- unique(M1$user_id)
M2_ids <- unique(M2$user_id)
M3_ids <- unique(M3$user_id)
M4_ids <- unique(M4$user_id)


M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))

Is there a way in R to come up with the above output with a shorter method using dplyr or even base R? The real data set has many years and months.

The format of the data is as follows:

> sapply(data, class)
     date   user_id 
"yearmon" "integer"

And the sample data:

> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017, 
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333, 
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667, 
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25, 
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L, 
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L, 
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")

标签： r dplyr time-series zoo

2条回答

够拽才男人

2楼-- · 2020-07-22 17:02

This is possible within the Tidyverse functions, too:

library(tidyverse)
library(lubridate)

transactions <- tibble(
  month=ymd(c("2017-01-01", "2017-01-01", "2017-02-01", "2017-02-01", "2017-03-01")),
  user_id=c(1, 2, 1, 3, 3)
)
#  Jan  1
#  Jan  2
#  Feb  1
#  Feb  3
#  Mar  1

# mark the cohort of the users
users <- transactions %>%
  arrange(month, user_id) %>%
  group_by(user_id) %>%
  top_n(-1, month) %>%
  # date of the first transaction
  rename(cohort = month)
users

transactions %>%
  group_by(month, user_id) %>%
  distinct() %>%
  left_join(users, by = 'user_id') %>%
  xtabs(~ cohort + month, data = .)
#            month
# cohort     2017-01-01 2017-02-01 2017-03-01
# 2017-01-01          2          1          0
# 2017-02-01          0          1          1

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2020-07-22 17:17

Here's an example:

library(data.table)
library(zoo)
data <- structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017, 
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333, 
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667, 
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25, 
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L, 
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L, 
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
data <- data[c(1,1:nrow(data)),]
setDT(data)
(cohorts <- dcast(unique(data)[,cohort:=min(date),by=user_id],cohort~date))
#      cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# 1: Jan 2017        5        3        2        2
# 2: Feb 2017        0        2        0        1
# 3: Mrz 2017        0        0        3        1
# 4: Apr 2017        0        0        0        1

m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon") 
m
#           yearmon
# cohort     Jan 2017 Feb 2017 Mrz 2017 Apr 2017
#   Jan 2017        5        3        2        2
#   Feb 2017       NA        2        0        1
#   Mrz 2017       NA       NA        3        1
#   Apr 2017       NA       NA       NA        1

0人赞添加讨论(0) 举报

tracking a cohort over time in R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间