Counting (for each row) how many times an OR condi

2019-08-05 08:31发布

问题:

My question is similar to this one, except a bit different. In the initial question, I was trying to count (for each row) how many columns satisfied a condition. I would like to do something similar, except that the condition involves several columns with an OR condition, and my real data has many columns, so ideally, I'd like to reference the columns using a regular expression.

I have the following data:

colnames <- c(paste("col",rep(LETTERS[1:2],each=4),rep(1:4,2),sep=""),c("meh","muh"))
df <- as.data.frame(matrix(sample(c("Yes","No"),200,replace=TRUE),ncol=10))
names(df) <- colnames
df
   colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1     No   Yes    No    No    No   Yes   Yes    No Yes Yes
2     No   Yes   Yes   Yes   Yes    No   Yes    No  No  No
3     No    No    No   Yes    No    No    No    No Yes  No
4    Yes    No   Yes   Yes   Yes   Yes   Yes   Yes  No Yes
5    Yes    No   Yes    No    No    No    No   Yes  No Yes
6    Yes    No    No    No   Yes   Yes    No    No  No  No
7    Yes    No    No    No   Yes   Yes   Yes    No Yes  No
8    Yes    No   Yes    No   Yes   Yes    No   Yes Yes  No
9     No   Yes    No    No    No   Yes   Yes    No  No  No
10   Yes   Yes    No    No   Yes    No   Yes    No Yes  No
11    No   Yes    No    No   Yes    No   Yes   Yes  No  No
12    No   Yes   Yes   Yes    No    No   Yes    No  No  No
13    No    No   Yes   Yes    No   Yes   Yes   Yes Yes  No
14   Yes   Yes    No    No    No    No   Yes    No  No Yes
15   Yes    No   Yes   Yes    No   Yes    No   Yes  No  No
16    No   Yes   Yes    No    No    No   Yes    No  No  No
17   Yes    No    No    No    No   Yes   Yes   Yes  No Yes
18   Yes    No   Yes   Yes    No    No    No    No  No Yes
19    No    No    No    No    No   Yes    No    No  No Yes
20    No   Yes    No    No   Yes   Yes   Yes    No  No  No

I would like to create a new column Nb that records, for each line: the number of times at least one of colA2, colA3,colA4 is =="Yes" plus the number of times at least one of colB2, colB3,colB4 is =="Yes".

If there was not this "OR" condition implied when look at a group of columns [colA2, colA3,colA4], and I was adding the number of columns satisfying the condition, I could have used something like:

df$Nb <- rowSums(df[, grep("^col[A-B][2-4]", names(df))] == "Yes")

I would like to use regex if possible to reference the columns, as in my real data letters and numbers go further than B and 5 respectively.

Thank you!

回答1:

You could adapt your rowSums approach to just the groups of columns in each of your OR conditions, then add > 0 to make it "at least one." Thus, "at least one of the A values is Yes" would look like:

rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0

Then you can combine them using +:

(rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0) +
 (rowSums(df[, grep("^colB[2-4]", names(df))] == "Yes") > 0)

Incidentally, you would have an easier time answering questions like these if your data were in a tidy format: that is, if each column were a separate variable. Right now it looks like you're storing attributes of your data (A, B, 1-4) as parts of your column names, which is the reason operations like "using columns with the value 'A'" are very awkward. What if you instead rearranged your data, using the dplyr and tidyr packages, as:

library(dplyr)
library(tidyr)
df$index <- 1:nrow(df)
newdf <- df %>% gather(key, value, colA1:colB4) %>%
    separate(key, c("col", "letter", "number"), c(-3, -2)) %>%
    mutate(number = as.numeric(number))

This rearranges your data as (note that I gave each of your rows its own "index" variable):

  meh muh index col letter number value
1 Yes  No     1 col      A      1   Yes
2 Yes  No     2 col      A      1   Yes
3  No  No     3 col      A      1   Yes
4 Yes  No     4 col      A      1    No
5 Yes Yes     5 col      A      1    No
6 Yes Yes     6 col      A      1   Yes

You can then group, summarize, filter and manipulate these observations more naturally. For example, you seem to want to drop the columns with the number 1: rather than needing a regular expression, you could simply do newdf %>% filter(number > 1).

Here's how you would perform the kind of OR operation you're describing:

hasyes <- newdf %>% group_by(index, letter) %>% filter(number > 1) %>%
              summarize(hasyes = any(value == "Yes"))

For each of your original row+letter combinations, you now have a logical value for whether Yes appears:

  index letter hasyes
1     1      A   TRUE
2     1      B   TRUE
3     2      A   TRUE
4     2      B   TRUE
5     3      A  FALSE
6     3      B   TRUE

One more summarizing operation gets this into the form you want:

result <- hasyes %>% group_by(index) %>% summarize(yeses = sum(hasyes))

What's important about this solution is that it will work for any number of letters you have (that is, if it goes from A-Z instead of just A and B) equally easily.