Add column to dataframe depending on specific row

2019-07-25 00:38发布

问题:

I have to adjust a code which works perfectly with a different data.frame but with similar conditions.

Here an example of my data.frame:

df <- read.table(text = 'ID    Day Count
    33012   9526    4
    35004   9526    4
    37006   9526    4
    37008   9526    4
    21009   1913    3
    24005   1913    3
    25009   1913    3
    22317   2286    2
    37612   2286    2
    25009   14329   1
    48007   9527    0
    88662   9528    0
    1845    9528    0
    8872    2287    0
    49002   1914    0
    1664    1915    0', header = TRUE)

I need to add a new column (new_col) to my data.frame which contains values from 1 to 4. These new_col values have to include, each one, day (x) day (x +1) and day (x +2), where x = 9526, 1913, 2286, 14329 (column Day).

My output should be the following:

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
22317   2286    2     3
37612   2286    2     3
25009   14329   1     4
48007   9527    0     1
88662   9528    0     1
1845    9528    0     1
8872    2287    0     3
49002   1914    0     2
1664    1915    0     2

The data.frame ordered by new_col will be then:

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
48007   9527    0     1
88662   9528    0     1
1845    9528    0     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
49002   1914    0     2
1664    1915    0     2
22317   2286    2     3
37612   2286    2     3
8872    2287    0     3
25009   14329   1     4

My real data.frame is more complex than the example (i.e. more columns and more values in the Count column).

The code that @mrbrick suggested me in my previous question (Add column to dataframe depending on specific row values) is the following:

x <- c(1913, 2286, 9526, 14329) 
df$new_col <- cut(df$Day, c(-Inf, x, Inf))
df$new_col <- as.numeric(factor(df$new_col, levels=unique(df$new_col)))

But it works only with day x, day x -1 and day x -2.

Any suggestion will be really helpful.

回答1:

Assuming that the Day values in the different sequential groups are such that dropping the last two digits of Day identifies each group convert what is left to a factor with sequence numbers as labels. No packages are used.

 g <- df$Day %/% 100
 u <- unique(g)
 transform(df, new_col = factor(g, levels = u, labels = seq_along(u)))

giving:

      ID   Day Count new_col
1  33012  9526     4       1
2  35004  9526     4       1
3  37006  9526     4       1
4  37008  9526     4       1
5  21009  1913     3       2
6  24005  1913     3       2
7  25009  1913     3       2
8  22317  2286     2       3
9  37612  2286     2       3
10 25009 14329     1       4
11 48007  9527     0       1
12 88662  9528     0       1
13  1845  9528     0       1
14  8872  2287     0       3
15 49002  1914     0       2
16  1664  1915     0       2

Another possibility is to replace the g <- ... line with one of the following:

(a) known number of groups use kmeans with the the appropriate number of clusters:

g <- kmeans(df$Day, 4)$cluster

(b) manually set or manually set centers and use that to initiate kmeans:

centers <-  c(1913, 2286, 9526, 14329) + 1
g <- kmeans(df$day, centers)$cluster

(c) check x-1 and x-2 or derive centers like this. If for a day x there is no x-1 or x-2 then x must be the first in the sequence so we pick out such values and add 1 to get the centers. Unlike (a) which requires that we know the number of clusters and (b) which requires that we know the actual sequences this one does not require that these be known.

centers <- with(df, unique(Day[ ! ((Day-1) %in% Day) & ! ((Day-2) %in% Day) ]) + 1)
g <- kmeans(df$Day, centers)$cluster

(d) simplication of last point or if we are guarantted that if x is the first in the sequence then x, x+1 and x+2 all appear then we can be sure that x is the first in the sequence if there is a no x-1 so we can simplify (c) to:

# assumes x, x+1, x+2 all appear for each sequence
centers <- with(df, unique(Day[ ! (Day-1) %in% Day ]) + 1)
g <- kmeans(df$Day, centers)$cluster

The kmeans solutions should work if the groups are sufficiently separated and based on the data shown in the question it seems that they are.



回答2:

Using base R, you could create a data.frame with ID columns, the day you want (x, x+1, x+2) and with the new_col you want, then merge this data.frame with your original one.

That works if you know in advance all the x Day you have.

df <- read.table(text = 'ID    Day Count
    33012   9526    4
                 35004   9526    4
                 37006   9526    4
                 37008   9526    4
                 21009   1913    3
                 24005   1913    3
                 25009   1913    3
                 22317   2286    2
                 37612   2286    2
                 25009   14329   1
                 48007   9527    0
                 88662   9528    0
                 1845    9528    0
                 8872    2287    0
                 49002   1914    0
                 1664    1915    0', header = TRUE)
# identify the day you want (x variable in your example)
x <- c(9526, 1913, 2286, 14329)
# create new_col for each x as you wish, and repeat for x + i, then rbind the results data.frame
new_col_df <- do.call(rbind, 
                      lapply(seq(0, 2, by = 1), 
                             function(add) data.frame(x = x + add, new_col = seq_along(x))
                             )
                      )
# merge with the original df
output_df <-merge(df, new_col_df, by.x = "Day", by.y = "x")
# ordered output is
output_df[order(output_df$new_col),]
#>      Day    ID Count new_col
#> 9   9526 33012     4       1
#> 10  9526 35004     4       1
#> 11  9526 37006     4       1
#> 12  9526 37008     4       1
#> 13  9527 48007     0       1
#> 14  9528 88662     0       1
#> 15  9528  1845     0       1
#> 1   1913 21009     3       2
#> 2   1913 24005     3       2
#> 3   1913 25009     3       2
#> 4   1914 49002     0       2
#> 5   1915  1664     0       2
#> 6   2286 22317     2       3
#> 7   2286 37612     2       3
#> 8   2287  8872     0       3
#> 16 14329 25009     1       4


标签: r dataframe add