add column to data frame, testing categorical vari

2020-07-22 06:45发布

问题:

I have referred:

  • How to add a factor column to dataframe based on a conditional statement from another column?,
  • How to add column into a dataframe based on condition in R programming and
  • R: Add column with condition-check on three columns? .

All the examples are are based on testing for either numeric vectors or NA in other columns and adding a new variable. Here's a short reproducible example:

x <- c("dec 12", "jan 13", "feb 13", "march 13", "apr 13", "may 13",
       "june 13", "july 13", "aug 13", "sep 13", "oct 13", "nov 13")
y <- c(234, 678, 534, 122, 179, 987, 872, 730, 295, 450, 590, 312)
df<-data.frame(x,y)

I want to add, "winter" for df$x = dec | jan | feb, "spring" for march|apr|may, "summer" and "autumn".

I tried

df$season <- ifelse(df[1:3, ], "winter", ifelse(df[4:6, ], "spring", 
                    ifelse(df[7:9, ], "summer", "autumn")))

which I know is a very inefficient way of doing things but I'm a newbie and a kludger. It returned the error:

Error in ifelse(df[1:3, ], "winter", ifelse(df[4:6, ], "spring",
ifelse(df[7:9,  : (list) object cannot be coerced to type 'logical'

If the same data frame had thousands of rows and I wanted to loop through it and create a new variable for season based on month of the year, how could I do this? I referred:" Looping through a data frame to add a column depending variables in other columns" but this is looping and setting a mathematical operator for creating the new variable. I tried external resources: a thread on the R mailing list and a thread on the TalkStats forum. However, again both are based on numeric variables and conditions.

回答1:

If you have a really large data frame, then data.table will be very helpful for you. The following works:

library(data.table)
x <- c("dec 12", "jan 13", "feb 13", "march 13", "apr 13", "may 13",
   "june 13", "july 13", "aug 13", "sep 13", "oct 13", "nov 13")
y <- c(234, 678, 534, 122, 179, 987, 872, 730, 295, 450, 590, 312)
df <-data.frame(x,y)
DT <- data.table(df)
DT[, month := substr(tolower(x), 1, 3)]
DT[, season := ifelse(month %in% c("dec", "jan", "feb"), "winter",
               ifelse(month %in% c("mar", "apr", "may"), "spring",
               ifelse(month %in% c("jun", "jul", "aug"), "summer", 
               ifelse(month %in% c("sep", "oct", "nov"), "autumn", NA))))]
DT
          x   y month season
1:   dec 12 234   dec winter
2:   jan 13 678   jan winter
3:   feb 13 534   feb winter
4: march 13 122   mar spring
5:   apr 13 179   apr spring
6:   may 13 987   may spring
7:  june 13 872   jun summer
8:  july 13 730   jul summer
9:   aug 13 295   aug summer
0:   sep 13 450   sep autumn
1:   oct 13 590   oct autumn
12:  nov 13 312   nov autumn