I have a dataframe with a column of strings and want to extract substrings of those into a new column.
Here is some sample code and data showing I want to take the string after the final underscore character in the id
column in order to create a new_id
column.
The id
column entry always has 2 underscore characters and it's always the final substring I would like.
df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )
require(dplyr)
df = df %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3])
I was expecting strsplit to act on each row in turn.
However, the new_id
column only contains ABC
in each row, whereas I would like ABC
in row 1 and NHYK
in row 2. Do you know why this fails and how to achieve what I want?
You could use stringr::str_extract
:
library(stringr)
df %>%
dplyr::mutate(new_id = str_extract(id, "[^_]+$"))
#> id x new_id
#> 1 abcd_123_ABC 1 ABC
#> 2 abc_5234_NHYK 2 NHYK
The regex says, match one or more (+
) of the characters that aren't _
(the negating [^ ]
), followed by end of string ($
).
An alternative without regex and keeping in the tidyverse
style is to use tidyr::separate()
. Note, this does remove the input column by default (remove=FALSE
to prevent it).
## using your example data
df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )
## separate knowing you will have three components
df %>% separate(id, c("first", "second", "new_id"), sep = "_") %>% select(-first, -second)
## returns
new_id x
1 ABC 1
2 NHYK 2
Use dplyr::rowwise
:
df %>% dplyr::rowwise() %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3])
Further alternatives are discussed here:
http://www.expressivecode.org/2014/12/17/mutating-using-functions-in-dplyr/
Here's one way to use strsplit
in a general way to do what you're looking for.
library(dplyr)
df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )
temp <- seq(from=3, by=3, length.out = length(df))
dfn <- df %>% dplyr::mutate(new_id = unlist(strsplit(id, split="_"))[temp])
> dfn
id x new_id
1 abcd_123_ABC 1 ABC
2 abc_5234_NHYK 2 NHYK