I have to adjust a code which works perfectly with a different data.frame but with similar conditions.
Here an example of my data.frame:
df <- read.table(text = 'ID Day Count
33012 9526 4
35004 9526 4
37006 9526 4
37008 9526 4
21009 1913 3
24005 1913 3
25009 1913 3
22317 2286 2
37612 2286 2
25009 14329 1
48007 9527 0
88662 9528 0
1845 9528 0
8872 2287 0
49002 1914 0
1664 1915 0', header = TRUE)
I need to add a new column (new_col
) to my data.frame which contains values from 1 to 4. These new_col
values have to include, each one, day (x) day (x +1) and day (x +2), where x = 9526, 1913, 2286, 14329 (column Day
).
My output should be the following:
ID Day Count new_col
33012 9526 4 1
35004 9526 4 1
37006 9526 4 1
37008 9526 4 1
21009 1913 3 2
24005 1913 3 2
25009 1913 3 2
22317 2286 2 3
37612 2286 2 3
25009 14329 1 4
48007 9527 0 1
88662 9528 0 1
1845 9528 0 1
8872 2287 0 3
49002 1914 0 2
1664 1915 0 2
The data.frame ordered by new_col
will be then:
ID Day Count new_col
33012 9526 4 1
35004 9526 4 1
37006 9526 4 1
37008 9526 4 1
48007 9527 0 1
88662 9528 0 1
1845 9528 0 1
21009 1913 3 2
24005 1913 3 2
25009 1913 3 2
49002 1914 0 2
1664 1915 0 2
22317 2286 2 3
37612 2286 2 3
8872 2287 0 3
25009 14329 1 4
My real data.frame is more complex than the example (i.e. more columns and more values in the Count
column).
The code that @mrbrick suggested me in my previous question (Add column to dataframe depending on specific row values) is the following:
x <- c(1913, 2286, 9526, 14329)
df$new_col <- cut(df$Day, c(-Inf, x, Inf))
df$new_col <- as.numeric(factor(df$new_col, levels=unique(df$new_col)))
But it works only with day x, day x -1 and day x -2.
Any suggestion will be really helpful.
Assuming that the Day
values in the different sequential groups are such that dropping the last two digits of Day
identifies each group convert what is left to a factor with sequence numbers as labels. No packages are used.
g <- df$Day %/% 100
u <- unique(g)
transform(df, new_col = factor(g, levels = u, labels = seq_along(u)))
giving:
ID Day Count new_col
1 33012 9526 4 1
2 35004 9526 4 1
3 37006 9526 4 1
4 37008 9526 4 1
5 21009 1913 3 2
6 24005 1913 3 2
7 25009 1913 3 2
8 22317 2286 2 3
9 37612 2286 2 3
10 25009 14329 1 4
11 48007 9527 0 1
12 88662 9528 0 1
13 1845 9528 0 1
14 8872 2287 0 3
15 49002 1914 0 2
16 1664 1915 0 2
Another possibility is to replace the g <- ...
line with one of the following:
(a) known number of groups use kmeans
with the the appropriate number of clusters:
g <- kmeans(df$Day, 4)$cluster
(b) manually set or manually set centers and use that to initiate kmeans
:
centers <- c(1913, 2286, 9526, 14329) + 1
g <- kmeans(df$day, centers)$cluster
(c) check x-1 and x-2 or derive centers
like this. If for a day x
there is no x-1
or x-2
then x
must be the first in the sequence so we pick out such values and add 1 to get the centers. Unlike (a) which requires that we know the number of clusters and (b) which requires that we know the actual sequences this one does not require that these be known.
centers <- with(df, unique(Day[ ! ((Day-1) %in% Day) & ! ((Day-2) %in% Day) ]) + 1)
g <- kmeans(df$Day, centers)$cluster
(d) simplication of last point or if we are guarantted that if x
is the first in the sequence then x, x+1 and x+2 all appear then we can be sure that x
is the first in the sequence if there is a no x-1
so we can simplify (c) to:
# assumes x, x+1, x+2 all appear for each sequence
centers <- with(df, unique(Day[ ! (Day-1) %in% Day ]) + 1)
g <- kmeans(df$Day, centers)$cluster
The kmeans
solutions should work if the groups are sufficiently separated and based on the data shown in the question it seems that they are.
Using base R, you could create a data.frame with ID columns, the day you want (x
, x+1
, x+2
) and with the new_col
you want, then merge this data.frame with your original one.
That works if you know in advance all the x
Day you have.
df <- read.table(text = 'ID Day Count
33012 9526 4
35004 9526 4
37006 9526 4
37008 9526 4
21009 1913 3
24005 1913 3
25009 1913 3
22317 2286 2
37612 2286 2
25009 14329 1
48007 9527 0
88662 9528 0
1845 9528 0
8872 2287 0
49002 1914 0
1664 1915 0', header = TRUE)
# identify the day you want (x variable in your example)
x <- c(9526, 1913, 2286, 14329)
# create new_col for each x as you wish, and repeat for x + i, then rbind the results data.frame
new_col_df <- do.call(rbind,
lapply(seq(0, 2, by = 1),
function(add) data.frame(x = x + add, new_col = seq_along(x))
)
)
# merge with the original df
output_df <-merge(df, new_col_df, by.x = "Day", by.y = "x")
# ordered output is
output_df[order(output_df$new_col),]
#> Day ID Count new_col
#> 9 9526 33012 4 1
#> 10 9526 35004 4 1
#> 11 9526 37006 4 1
#> 12 9526 37008 4 1
#> 13 9527 48007 0 1
#> 14 9528 88662 0 1
#> 15 9528 1845 0 1
#> 1 1913 21009 3 2
#> 2 1913 24005 3 2
#> 3 1913 25009 3 2
#> 4 1914 49002 0 2
#> 5 1915 1664 0 2
#> 6 2286 22317 2 3
#> 7 2286 37612 2 3
#> 8 2287 8872 0 3
#> 16 14329 25009 1 4