I am new to data science and I am working on a model that kind of looks like the sample data shown below. However in the orginal data there are many id_num
and Events
. My objective is to predict the next 3 events of each id_num
based on their previous Events
.
Please help me in solving this or regarding the method to be used for solving, using R programming.
The simplest "prediction" is to assume that the sequence of letters will repeat for each id_num
. I hope this is in line what the OP understands by "prediction".
The code
library(data.table)
DT[, .(Events = append(Events, head(rep(Events, 3L), 3L))), by = id_num]
creates
id_num Events
1: 1 A
2: 1 B
3: 1 C
4: 1 D
5: 1 E
6: 1 A
7: 1 B
8: 1 C
9: 2 B
10: 2 E
11: 2 B
12: 2 E
13: 2 B
14: 3 E
15: 3 A
16: 3 E
17: 3 A
18: 3 E
19: 3 A
20: 3 E
21: 4 C
22: 4 C
23: 4 C
24: 4 C
25: 5 F
26: 5 G
27: 5 F
28: 5 G
29: 5 F
id_num Events
data.table
is used here because of the easy to use grouping function and because I'm acquainted with it.
Explanation
For each id_num
the existing sequence of letters is replicated 3 times using rep()
to ensure enough values to fill at least 3 next values. But, only the first 3 values are taken using head()
. These 3 values are appended to the existing sequence for each id_num
Some tuning
There are two possible optimisations:
- If the sequence of values is much longer than the number of values to predict
n_pred
, simply repeating the long sequence n_pred
times is a waste.
- The call to
append()
can be avoided if the existing sequence will be repeated one more time.
So, the optimised code looks like:
n_pred <- 3L
DT[, .(Events = head(rep(Events, 1L + ceiling(n_pred / .N)), .N + n_pred)), by = id_num]
Note that .N
is a special symbol in data.table
syntax containing the number rows in a group. head()
now returns the original sequence plus the predicted values.
Data
DT <- data.table(
id_num = c(rep(1L, 5L), 2L, 2L, rep(3L, 4L), 4L, 5L, 5L),
Events = c(LETTERS[1:5], "B", "E", rep(c("E", "A"), 2L), "C", "F", "G")
)
DT
id_num Events
1: 1 A
2: 1 B
3: 1 C
4: 1 D
5: 1 E
6: 2 B
7: 2 E
8: 3 E
9: 3 A
10: 3 E
11: 3 A
12: 4 C
13: 5 F
14: 5 G