Add extra level to factors in dataframe

2019-01-11 04:06发布

问题:

I have a data frame with numeric and ordered factor columns. I have lot of NA values, so no level is assigned to them. I changed NA to "No Answer", but levels of the factor columns don't contain that level, so here is how I started, but I don't know how to finish it in an elegant way:

addNoAnswer = function(df) {
   factorOrNot = sapply(df, is.factor)
   levelsList = lapply(df[, factorOrNot], levels)
   levelsList = lapply(levelsList, function(x) c(x, "No Answer"))
   ...

Is there a way to directly apply new levels to factor columns, for example, something like this:

df[, factorOrNot] = lapply(df[, factorOrNot], factor, levelsList)

Of course, this doesn't work correctly.

I want the order of levels preserved and "No Answer" level added to last place.

回答1:

You could define a function that adds the levels to a factor, but just returns anything else:

addNoAnswer <- function(x){
  if(is.factor(x)) return(factor(x, levels=c(levels(x), "No Answer")))
  return(x)
}

Then you just lapply this function to your columns

df <- as.data.frame(lapply(df, addNoAnswer))

That should return what you want.



回答2:

The levels function accept the levels(x) <- value call. Therefore, it's very easy to add different levels:

f1 <- factor(c("a", "a", NA, NA, "b", NA, "a", "c", "a", "c", "b"))
str(f1)
 Factor w/ 3 levels "a","b","c": 1 1 NA NA 2 NA 1 3 1 3 ...
levels(f1) <- c(levels(f1),"No Answer")
f1[is.na(f1)] <- "No Answer"
str(f1)
 Factor w/ 4 levels "a","b","c","No Answer": 1 1 4 4 2 4 1 3 1 3 ...

You can then loop it around all variables in a data.frame:

f1 <- factor(c("a", "a", NA, NA, "b", NA, "a", "c", "a", "c", "b"))
f2 <- factor(c("c", NA, "b", NA, "b", NA, "c" ,"a", "d", "a", "b"))
f3 <- factor(c(NA, "b", NA, "b", NA, NA, "c", NA, "d" , "e", "a"))
df1 <- data.frame(f1,n1=1:11,f2,f3)

str(df1)
  'data.frame':   11 obs. of  4 variables:
  $ f1: Factor w/ 3 levels "a","b","c": 1 1 NA NA 2 NA 1 3 1 3 ...
  $ n1: int  1 2 3 4 5 6 7 8 9 10 ...
  $ f2: Factor w/ 4 levels "a","b","c","d": 3 NA 2 NA 2 NA 3 1 4 1 ...
  $ f3: Factor w/ 5 levels "a","b","c","d",..: NA 2 NA 2 NA NA 3 NA 4 5 ...    

for(i in 1:ncol(df1)) if(is.factor(df1[,i])) levels(df1[,i]) <- c(levels(df1[,i]),"No Answer")
df1[is.na(df1)] <- "No Answer"

str(df1)
 'data.frame':   11 obs. of  4 variables:
  $ f1: Factor w/ 4 levels "a","b","c","No Answer": 1 1 4 4 2 4 1 3 1 3 ...
  $ n1: int  1 2 3 4 5 6 7 8 9 10 ...
  $ f2: Factor w/ 5 levels "a","b","c","d",..: 3 5 2 5 2 5 3 1 4 1 ...
  $ f3: Factor w/ 6 levels "a","b","c","d",..: 6 2 6 2 6 6 3 6 4 5 ...


回答3:

Since this question was last answered this has become possible using fct_explicit_na() from the forcats package. I add here the example given in the documentation.

f1 <- factor(c("a", "a", NA, NA, "a", "b", NA, "c", "a", "c", "b"))
table(f1)

# f1
# a b c 
# 4 2 2 

f2 <- forcats::fct_explicit_na(f1)
table(f2)

# f2
#     a         b         c (Missing) 
#     4         2         2         3 

Default value is (Missing) but this can be changed via the na_level argument.



回答4:

Expanding on ilir's answer and its comment, you can check if a column is a factor and that it does not already contain the new level, then add the level and thus make the function re-runable:

addLevel <- function(x, newlevel=NULL) {
  if(is.factor(x)) {
    if (is.na(match(newlevel, levels(x))))
      return(factor(x, levels=c(levels(x), newlevel)))
  }
  return(x)
}

You can then apply it like so:

dataFrame$column <- addLevel(dataFrame$column, "newLevel")


回答5:

You need to convert the column to character, next add the new level based on the condition then at last convert column to factor.

Steps 1.First Convert Factor column to character:

        df$column2 <- as.character(column2)

2.Add the new level

        df[df$column1=="XYZ",]column2 <- "new_level"

3.Convert to factor again

        df$column2 <- as.factor(df$column2)


回答6:

I have a very simple answer that may not directly address your specific scenario, but is a simple way to do this generally

levels(df$column) <- c(levels(df$column), newFactorLevel)