Let's say I start with a dataset like this (it's from Gallup). I want to pull the year and date out of the dataset and into a new column. So I try to split the date string...
index date R D
1 2018 Jan 2-7 35 50
2 2017 Dec 4-11 41 45
3 2017 Nov 2-8 39 46
4 2017 Oct 5-11 39 46
5 2017 Sep 6-10 45 47
6 2017 Aug 2-6 43 46
.. using mutate
dataset <- data %>%
mutate(Y = strsplit(date, split = " ")[[1]][1]) %>%
mutate(M = strsplit(date, split = " ")[[1]][2])
But strsplit, rather than operate on the date row, seems to operate on a list of all column values.
So I end up with the [[1]] subset accessor just grabbing the first row value, rather than a the list entry relevant to each row.
index date R D Y M
1 2018 Jan 2-7 35 3 2018 Jan
2 2017 Dec 4-11 41 3 2018 Jan
3 2017 Nov 2-8 39 3 2018 Jan
4 2017 Oct 5-11 39 3 2018 Jan
5 2017 Sep 6-10 45 3 2018 Jan
6 2017 Aug 2-6 43 3 2018 Jan
How can I split the string so an extrapolate the value from the list for each row? Using index as a subset accessor doesn't work.
I would recommend using the package stringr
, which is part of the tidyverse, and thus works seamlessly with dplyr.
data %>% mutate(Y = str_extract(date, "^\\d{4}"),
M = str_extract(date, "[A-Za-z]{3}"))
# index date R D Y M
# 1 1 2018 Jan 2-7 35 50 2018 Jan
# 2 2 2017 Dec 4-11 41 45 2017 Dec
# 3 3 2017 Nov 2-8 39 46 2017 Nov
# 4 4 2017 Oct 5-11 39 46 2017 Oct
# 5 5 2017 Sep 6-10 45 47 2017 Sep
# 6 6 2017 Aug 2-6 43 46 2017 Aug
str_extract
allows you to extract substrings based on a pattern -- here, we use two different regular expressions. The first matches 4 consecutive digits (\\d{4}
) at the start of the string (^
). The second expression simply takes 3 consecutive letters ([A-Za-z]
), which is safe given the structure of your dates.
If you'd still like to use strsplit
with mutate
, however, you can add a call to rowwise
:
data %>% rowwise() %>% mutate(Y = strsplit(date, split = " ")[[1]][1],
M = strsplit(date, split = " ")[[1]][2])
We can use the extract
function from the tidyr with capturing groups.
library(tidyr)
dat2 <- dat %>%
extract(date, into = c("Y", "M"), regex = "(\\d{4}) ([A-Za-z]{3})", remove = FALSE)
dat2
# index date Y M R D
# 1 1 2018 Jan 2-7 2018 Jan 35 50
# 2 2 2017 Dec 4-11 2017 Dec 41 45
# 3 3 2017 Nov 2-8 2017 Nov 39 46
# 4 4 2017 Oct 5-11 2017 Oct 39 46
# 5 5 2017 Sep 6-10 2017 Sep 45 47
# 6 6 2017 Aug 2-6 2017 Aug 43 46
DATA
dat <- read.table(text = "index date R D
1 '2018 Jan 2-7' 35 50
2 '2017 Dec 4-11' 41 45
3 '2017 Nov 2-8' 39 46
4 '2017 Oct 5-11' 39 46
5 '2017 Sep 6-10' 45 47
6 '2017 Aug 2-6' 43 46",
header = TRUE, stringsAsFactors = FALSE)