How do I create a new column based on multiple con

2020-02-01 02:44发布

问题:

I'm trying add a new column to a data frame based on several conditions from other columns. I have the following data:

> commute <- c("walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry")
> kids <- c("Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes")
> distance <- c(1, 12, 5, 25, 7, 2, "", 8, 19, 7, "", 4, 16, 12, 7)
> 
> df = data.frame(commute, kids, distance)
> df
   commute kids distance
1     walk  Yes        1
2     bike  Yes       12
3   subway   No        5
4    drive   No       25
5    ferry  Yes        7
6     walk  Yes        2
7     bike   No         
8   subway   No        8
9    drive  Yes       19
10   ferry  Yes        7
11    walk   No         
12    bike   No        4
13  subway  Yes       16
14   drive   No       12
15   ferry  Yes        7

If the following three conditions are met:

commute = walk OR bike OR subway OR ferry
AND
kids = Yes
AND
distance is less than 10

Then I'd like a new column called get.flyer to equal "Yes". The final data frame should look like this:

   commute kids distance get.flyer
1     walk  Yes        1       Yes
2     bike  Yes       12       Yes
3   subway   No        5          
4    drive   No       25          
5    ferry  Yes        7       Yes
6     walk  Yes        2       Yes
7     bike   No                   
8   subway   No        8          
9    drive  Yes       19          
10   ferry  Yes        7       Yes
11    walk   No                   
12    bike   No        4          
13  subway  Yes       16       Yes
14   drive   No       12          
15   ferry  Yes        7       Yes

回答1:

We can use %in% for comparing multiple elements in a column, & to check if both conditions are TRUE.

library(dplyr)
df %>%
     mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") & 
           as.character(kids) == "Yes" & 
           as.numeric(as.character(distance)) < 10)+1] )

It is better to create the data.frame with stringsAsFactors=FALSE as by default it is TRUE. If we check the str(df), we can find that all the columns are factor class. Also, if there are missing values, instead of "", NA can be used to avoid converting the class of a numeric column to something else.

If we rewrite the creation of 'df'

distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)

the above code can be simplified

df1 %>%
    mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
        kids == "Yes" &
        distance < 10)+1] )

For better understanding, some people prefer ifelse

df1 %>% 
   mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
                kids == "Yes" &
                distance < 10, 
                          "Yes", ""))

This can be also done easily with base R methods

df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
              kids == "Yes" & 
              distance < 10, 
                       "Yes", ""))


回答2:

The solution is already pointed out by @akrun. I'd like to present it in a more 'wrapped up' way.

You can use the ifelse statement to create a column based on one (or more) conditions. But first you have to change the 'encoding' of missing values in the distance column. You used "" to indicate a missing value, this however converts the entire column to string and inhibits numerical comparison (distance < 10 is not possible). The R way of indicating a missing value is NA, your column definition of distance should be:

distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)

The ifelse statement then looks like this:

df$get.flyer <- ifelse(
    ( 
        (df$commute %in% c("walk", "bike", "subway", "ferry")) &
        (df$kids == "Yes")                                     &
        (df$distance < 10)
    ),
    1,  # if condition is met, put 1
    0   # else put 0
)

Optional: Consider encoding your other columns in a different way as well:

  • you could use TRUE and FALSE instead of "Yes" and "No" for the kids variable
  • you could use a factor for commute


回答3:

Example, check if first_column_name is contained in second_column_name and write result to new_column

df$new_column <- apply(df, 1, function(x) grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE))

Details:

df$new_column <- # create a new column with name new_column on df
apply(df, 1 # `1` means for each row, `apply(df` means apply the following function on df
function(x) # Function definition to apply on each row, `x` means input row for each row.
grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE)) # Body of function to apply, basically run grepl to find if first_column_name is in second_column_name, fixed = TRUE means don't use regular expression just the plain text from first_column_name.