dplyr: vectorisation of substr

2020-04-16 03:02发布

问题:

Referring to question substr in dplyr %>% mutate, and to @akrun 's answer, why do the two created columns give the same answer?

df <- data_frame(t = '1234567890ABCDEFG', a = 1:5, b = 6:10)
df %>%  mutate(u = substr(t, a,  a + b), v = substring(t, a,  a + b))

I can't grasp the difference with the situation in the original question. Thank you!

回答1:

The difference is in the vectorization

substr("1234567890ABCDEFG", df$a, df$a+df$b)
#[1] "1234567"
substring("1234567890ABCDEFG", df$a, df$a+df$b)
#[1] "1234567"     "23456789"    "34567890A"   "4567890ABC"  "567890ABCDE"

The substr returns only a single value while the substring returns a vector of length equal to the number of rows in the dataset 'df'. As there is only a single value output, it gets recycled in the mutate. However, if we are using multiple values i.e.

substr(rep("1234567890ABCDEFG", nrow(df)), df$a, df$a+df$b)
#[1] "1234567"     "23456789"    "34567890A"   "4567890ABC"  "567890ABCDE"
substring(rep("1234567890ABCDEFG", nrow(df)), df$a, df$a+df$b)
#[1] "1234567"     "23456789"    "34567890A"   "4567890ABC"  "567890ABCDE"

Then, the output is the same. In the OP's example, it gets the above output as the x in substr has the same length as start and stop. We can replicate the first output with

 df %>%
     mutate(u = substr("1234567890ABCDEFG", a, a+b),
            v = substring("1234567890ABCDEFG", a, a+b)) 
#                 t     a     b       u           v
#              (chr) (int) (int)   (chr)       (chr)
#1 1234567890ABCDEFG     1     6 1234567     1234567
#2 1234567890ABCDEFG     2     7 1234567    23456789
#3 1234567890ABCDEFG     3     8 1234567   34567890A
#4 1234567890ABCDEFG     4     9 1234567  4567890ABC
#5 1234567890ABCDEFG     5    10 1234567 567890ABCDE


标签: r dplyr