grepl across multiple, specified columns

2019-04-10 03:39发布

问题:

I want to create a new column in my data frame that is either TRUE or FALSE depending on whether a term occurs in two specified columns. This is some example data:

AB <- c('CHINAS PARTY CONGRESS','JAPAN-US RELATIONS','JAPAN TRIES TO')
TI <- c('AMERICAN FOREIGN POLICY', 'CHINESE ATTEMPTS TO', 'BRITAIN HAS TEA')
AU <- c('AUTHOR 1', 'AUTHOR 2','AUTHOR 3')
M  <- data.frame(AB,TI,AU)

I can do it for one column, or the other, but I cannot figure out how to do it for both. In other words, I don't know how to combine these two lines that would not mutually overwrite each other.

M$China <- mapply(grepl, "CHINA|CHINESE|SINO", x=M$AB)
M$China <- mapply(grepl, "CHINA|CHINESE|SINO", x=M$TI)

It is important that I specify the columns, I cannot choose the whole data.frame.I have looked for other similar questions, but none seemed to apply to my case and I haven't been able to adapt any existing examples. This is what would make sense to me:

M$China <- mapply(grepl, "CHINA|CHINESE|SINO", x=(M$AB|M$TI)

回答1:

Using:

M$China <- !!rowSums(sapply(M[1:2], grepl, pattern = "CHINA|CHINESE|SINO"))

gives:

> M
                     AB                      TI       AU China
1 CHINAS PARTY CONGRESS AMERICAN FOREIGN POLICY AUTHOR 1  TRUE
2    JAPAN-US RELATIONS     CHINESE ATTEMPTS TO AUTHOR 2  TRUE
3        JAPAN TRIES TO         BRITAIN HAS TEA AUTHOR 3 FALSE

What this does:

  • sapply(M[1:2], grepl, pattern = "CHINA|CHINESE|SINO") loops over the two AB and TI columns and looks whether one of the parts of the pattern ("CHINA|CHINESE|SINO") is present.
  • The sapply-call returns a matrix of TRUE/FALSE values:

            AB    TI
    [1,]  TRUE FALSE
    [2,] FALSE  TRUE
    [3,] FALSE FALSE
    
  • With rowSums you check how many TRUE-values each row has.

  • By adding !! in front ofrowSums you convert all values from the rowSums-call higher than zero to TRUE and all eros to FALSE.


回答2:

If we need to collapse to a single vector, use the Map to loop through the columns, apply the pattern to get a list of logical vector, then Reduce it to a logical vector using |

M$China <- Reduce(`|`, Map(grepl, "CHINA|CHINESE|SINO", M))
M
#                     AB                      TI       AU China
#1 CHINAS PARTY CONGRESS AMERICAN FOREIGN POLICY AUTHOR 1  TRUE
#2    JAPAN-US RELATIONS     CHINESE ATTEMPTS TO AUTHOR 2  TRUE
#3        JAPAN TRIES TO         BRITAIN HAS TEA AUTHOR 3 FALSE

Or using the same methodology in tidyverse

library(tidyverse)
M %>%
   mutate_all(funs(str_detect(., "CHINA|CHINESE|SINO")))  %>% 
   reduce(`|`) %>%
   mutate(M, China = .)


标签: r grepl