Change level of multiple factor variables

2019-09-16 04:35发布

问题:

everyone -

I want to preface this by saying that I already looked at this link to try to solve my problem:

Applying the same factor levels to multiple variables in an R data frame

The difference is that in that problem, the OP wanted to change the levels of factors that all had the same levels. In my instance, I'm looking to change just the first level, which is set to ' ', to something like 'Unknown' and leave the rest of the levels alone. I know I could do this in a "non-R" way with something like this:

for (i in 64:88) {
  var.name <- colnames(df[i])
  levels(eval(parse(text=paste('df$', var.name, sep=''))))[levels(eval(parse(text=paste('df$', var.name, sep='')))) == ' '] <- 'Unknown'
}

But that's an inefficient way to do it. Trying to use the method proposed in the question linked above gave me this code:

df[64:88] <- lapply(df[64:88], factor, levels=c('Unknown', ??))

I don't know what to put in place of the question marks. I tried using just "levels[-1]" but it's obvious why that didn't work. I also tried "levels(df[64:88])[-1]" but again no good. So I tried to revamp the code with the following:

df[64:88] <- lapply(df[64:88], function(x) levels(x)[levels(x) == ' '] <- 'Unknown')

but I get NULL whenever I call levels$transaction_type1 (where transaction_type1 is the column name of df[64]).

What am I missing here?

Thanks in advance for your help!

Per a couple of requests, here is an example of my data:

df$transaction_type1[1:100]
  [1]                                                                                                                                                
 [13] HOME RENEW                                                                                                                                     
 [25]                                                                                                                                                
 [37]                                                                                                                                                
 [49]                                                                                                                                                
 [61] AUTO MANAGE                                                                                     AUTO RENEW                                     
 [73]             AUTO MANAGE                                                                                     AUTO RENEW                         
 [85]                                                                                                                                                
 [97]                                                
Levels:   AUTO CLAIM AUTO MANAGE AUTO PURCHASE AUTO RENEW HOME CLAIM HOME RENEW

As you can see, there is a lot of values equal to ' ' and all 25 variables look just like this, but with different levels. My data consists of 222 variables and 24,850 rows, so I don't know what the standard is on SO for giving example data. Also, this snippet of code might help as well:

> levels(df$transaction_type1)
#[1] " "             "AUTO CLAIM"    "AUTO MANAGE"   "AUTO PURCHASE" "AUTO RENEW"    "HOME CLAIM"    "HOME RENEW"

> levels(df$transaction_type1)[levels(df$transaction_type1) == ' '] <- 'Unknown'
> levels(df$transaction_type1)
#[1] "Unknown"       "AUTO CLAIM"    "AUTO MANAGE"   "AUTO PURCHASE" "AUTO RENEW"    "HOME CLAIM"    "HOME RENEW"   

If more information is needed, please let me know so I can provide it and also learn the SO standards of asking for help. Thanks!

回答1:

Something like this?

# it seems like your original data has a structure like this
df <- data.frame(x = factor(c("a", "", "b"), levels = c("", "a", "b")),
                 y = factor(c("c", "", "d"), levels = c("", "c", "d")))

lapply(df, levels)
# $x
# [1] ""  "a" "b"
# 
# $y
# [1] ""  "c" "d"    

# change the "" level to "unknown", and return the updated vector
df[] <- lapply(df, function(x){
 levels(x)[levels(x) == ""] <- "unknown"
 x
 })

lapply(df, levels)
# $x
# [1] "unknown" "a"       "b"      
# 
# $y
# [1] "unknown" "c"       "d"