Referring to question substr in dplyr %>% mutate, and to @akrun 's answer, why do the two created columns give the same answer?
df <- data_frame(t = '1234567890ABCDEFG', a = 1:5, b = 6:10)
df %>% mutate(u = substr(t, a, a + b), v = substring(t, a, a + b))
I can't grasp the difference with the situation in the original question.
Thank you!
The difference is in the vectorization
substr("1234567890ABCDEFG", df$a, df$a+df$b)
#[1] "1234567"
substring("1234567890ABCDEFG", df$a, df$a+df$b)
#[1] "1234567" "23456789" "34567890A" "4567890ABC" "567890ABCDE"
The substr
returns only a single value while the substring
returns a vector
of length
equal to the number of rows in the dataset 'df'. As there is only a single value output, it gets recycled in the mutate
. However, if we are using multiple values i.e.
substr(rep("1234567890ABCDEFG", nrow(df)), df$a, df$a+df$b)
#[1] "1234567" "23456789" "34567890A" "4567890ABC" "567890ABCDE"
substring(rep("1234567890ABCDEFG", nrow(df)), df$a, df$a+df$b)
#[1] "1234567" "23456789" "34567890A" "4567890ABC" "567890ABCDE"
Then, the output is the same. In the OP's example, it gets the above output as the x
in substr
has the same length as start
and stop
. We can replicate the first output with
df %>%
mutate(u = substr("1234567890ABCDEFG", a, a+b),
v = substring("1234567890ABCDEFG", a, a+b))
# t a b u v
# (chr) (int) (int) (chr) (chr)
#1 1234567890ABCDEFG 1 6 1234567 1234567
#2 1234567890ABCDEFG 2 7 1234567 23456789
#3 1234567890ABCDEFG 3 8 1234567 34567890A
#4 1234567890ABCDEFG 4 9 1234567 4567890ABC
#5 1234567890ABCDEFG 5 10 1234567 567890ABCDE