r dplyr ends_with multiple string matches

2020-06-23 04:02发布

问题:

Can I use dplyr::select(ends_with) to select column names that fit any of multiple conditions. Considering my column names, I want to use ends with instead of contains or matches, because the strings I want to select are relevant at the end of the column name, but may also appear in the middle in others. For instance,

df <- data.frame(a10 = 1:4,
             a11 = 5:8,
             a20 = 1:4,
             a12 = 5:8)

I want to select columns that end with 1 or 2, to have only columns a11 and a12. Is select(ends_with) the best way to do this?

Thanks!

回答1:

You can also do this using regular expressions. I know you did not want to use matches initially, but it actually works quite well if you use the "end of string" symbol $. Separate your various endings with |.

df <- data.frame(a10 = 1:4,
                 a11 = 5:8,
                 a20 = 1:4,
                 a12 = 5:8)

df %>% select(matches('1$|2$'))
  a11 a12
1   5   5
2   6   6
3   7   7
4   8   8

If you have a more complex example with a long list, use paste0 with collapse = '|'.

dff <- data.frame(a11 = 1:3,
                  a12 = 2:4,
                  a13 = 3:5,
                  a16 = 5:7,
                  my_cat = LETTERS[1:3],
                  my_dog = LETTERS[5:7],
                  my_snake = LETTERS[9:11])

my_cols <- paste0(c(1,2,6,'dog','cat'), 
                  '$', 
                  collapse = '|')

dff %>% select(matches(my_cols))

  a11 a12 a16 my_cat my_dog
1   1   2   5      A      E
2   2   3   6      B      F
3   3   4   7      C      G


回答2:

I don't know if ends_with() is the best way to do this, but you could also do this in base R with a logical index.

# Extract the last character of the column names, and test if it is "1" or "2"
lgl_index <- substr(x     = names(df), 
                    start = nchar(names(df)), 
                    stop  = nchar(names(df))) %in% c("1", "2")

With this index, you can subset the dataframe as follows

df[, lgl_index]
  a11 a12
1   5   5
2   6   6
3   7   7
4   8   8

or with dplyr::select()

select(df, which(lgl_index))
  a11 a12
1   5   5
2   6   6
3   7   7
4   8   8

keeping only columns that end with either 1 or 2.



回答3:

From version 1.0.0, you can combine multiple selections using Boolean logic such as ! (negate), & (and) and | (or).

### Install development version on GitHub first until CRAN version is available
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)

df <- data.frame(a10 = 1:4,
                 a11 = 5:8,
                 a20 = 1:4,
                 a12 = 5:8)

df %>% 
  select(ends_with("1") | ends_with("2"))
#>   a11 a12
#> 1   5   5
#> 2   6   6
#> 3   7   7
#> 4   8   8

or use num_range() to select the desired columns

df %>% 
  select(num_range(prefix = "a", range = 11:12))
#>   a11 a12
#> 1   5   5
#> 2   6   6
#> 3   7   7
#> 4   8   8

Created on 2020-02-17 by the reprex package (v0.3.0)