R - Replace values starting in selected column by

2019-08-03 09:06发布

问题:

I want to replace with zero the monthly values that are after a specific month by row. I have tried adapting Replace NA values in dataframe starting in varying columns without success. Given data:

df <- structure(list(Mth1 = c(1L, 3L, 4L, 1L, 2L), 
                      Mth2 = c(2L, 3L, 2L, 2L, 2L),
                      Mth3 = c(1L, 2L, 1L, 2L, 3L), 
                      Mth4 = c(3L, 1L, 3L, 4L, 2L),
                      ZeroMth = c(1L, 3L, 2L, 4L, 3L)),
                 .Names = c("Mth1", "Mth2", "Mth3", "Mth4", "ZeroMth"), class = "data.frame", 
                 row.names = c("1", "2", "3", "4", "5"))


> df
  Mth1 Mth2 Mth3 Mth4 ZeroMth
1    1    2    1    3       1
2    3    3    2    1       3
3    4    2    1    3       2
4    1    2    2    4       4
5    2    2    3    2       3

I would like to use the values in the ZeroMth column to specify the month where the replacements start. The desired output is:

> df1
  Mth1 Mth2 Mth3 Mth4
1    0    0    0    0
2    3    3    0    0 
3    4    0    0    0
4    1    2    2    0
5    2    2    0    0

回答1:

Use apply on each row (MARGIN = 1) and replace the values after the index specified in the last column to be zero

t(apply(X = df, MARGIN = 1, function(x)
    replace(x = x, list = x[NCOL(df)]:(NCOL(df)-1), values = 0)))
#  Mth1 Mth2 Mth3 Mth4 ZeroMth
#1    0    0    0    0       1
#2    3    3    0    0       3
#3    4    0    0    0       2
#4    1    2    2    0       4
#5    2    2    0    0       3


回答2:

You could also use lapply like this

setNames(data.frame(lapply(head(seq_along(df), -1), function(i) df[, i] * (i < df$ZeroMth))),
         head(names(df), -1))

which returns

  Mth1 Mth2 Mth3 Mth4
1    0    0    0    0
2    3    3    0    0
3    4    0    0    0
4    1    2    2    0
5    2    2    0    0

Here, you run through the locations of the month vectors and check if the element in the month is less than the designated zero month. If yes, the value is returned, otherwise it is 0. setNames is used to restore the variable names.


Some benchmarks

After testing, changing lapply to sapply results in more than a 2X speedup. The major slowdown is due to the conversion to data.frame.

This led me to check a bit further. Here are microbenchmark results.

microbenchmark(
db.mat=t(apply(X = df, MARGIN = 1, function(x)
         replace(x = x, list = x[NCOL(df)]:(NCOL(df)-1), values = 0))),
db.df=data.frame(t(apply(X = df, MARGIN = 1, function(x)
         replace(x = x, list = x[NCOL(df)]:(NCOL(df)-1), values = 0)))),
lmo.list=setNames(lapply(head(seq_along(df), -1),
                    function(i) df[, i] * (i < df$ZeroMth)),
                    head(names(df), -1)),
lmo.dfl=setNames(data.frame(lapply(head(seq_along(df), -1),
                         function(i) df[, i] * (i < df$ZeroMth))),
                 head(names(df), -1)),
lmo.dfs=setNames(data.frame(sapply(head(seq_along(df), -1),
                           function(i) df[, i] * (i < df$ZeroMth))),
                 head(names(df), -1)),
lmo.listAlt=setNames(lapply(head(seq_along(df), -1),
                    function(i) {temp <- df[, i]; temp[i < df$ZeroMth] <- 0; temp}),
                    head(names(df), -1)),
lmo.dflAlt=setNames(data.frame(lapply(head(seq_along(df), -1),
                         function(i) {temp <- df[, i]; temp[i < df$ZeroMth] <- 0; temp})),
                 head(names(df), -1)),
lmo.dfsAlt=setNames(data.frame(sapply(head(seq_along(df), -1),
                           function(i) {temp <- df[, i]; temp[i < df$ZeroMth] <- 0; temp})),
                 head(names(df), -1)))

Unit: microseconds
        expr     min       lq     mean   median      uq      max neval  cld
      df.mat 135.994 155.2380 161.2480 159.6570 166.785  196.436   100  b  
       db.df 225.231 236.9190 248.3295 246.0430 256.164  340.411   100   c 
    lmo.list  84.960  99.5005 105.8299 104.9175 110.905  156.806   100 a   
     lmo.dfl 439.057 459.1565 480.3425 476.5475 492.656  647.751   100    d
     lmo.dfs 173.057 187.3120 217.2876 195.8650 202.850 2257.151   100   c 
 lmo.listAlt  91.803 108.0535 114.6253 113.1860 118.602  185.602   100 ab  
  lmo.dflAlt 458.158 481.2520 521.6052 498.2155 516.462 2584.163   100    d
  lmo.dfsAlt 181.610 198.4310 221.5613 204.2755 212.686 1611.395   100   c 

Wow, lapply with data.frame is super slow.



回答3:

We can also make this compact by

(col(df[-5]) <df$ZeroMth[row(df[-5])])*df[-5]
#    Mth1 Mth2 Mth3 Mth4
#1    0    0    0    0
#2    3    3    0    0
#3    4    0    0    0
#4    1    2    2    0
#5    2    2    0    0


标签: r replace subset