I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing categories with few observations into "other" and am looking for a quick way to do that--I have a perhaps 20 levels of a variable, but am interested in collapsing a bunch of them to one.
data <- data.frame(employees = sample.int(1000,500),
naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
100, replace=T))
Here are my levels of interest, and their labels in separate vectors.
#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
'621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
'Offices of dentists',
'Offices of all other miscellaneous health practitioners',
'Home health care services',
'Offices of Mental Health Practitioners',
'Offices of chiropractors',
'Medical Laboratories',
'Outpatient Mental Health and Substance Abuse Centers',
'Offices of optometrists')
I could use the factor()
call, enumerate them all, classifying as "other" for each time a category had few observations.
Assuming that the top8
and top8_desc
above are the actual top 8, what is the best way to declare data$naics
as a factor variable so that the values in top8
are correcly coded and everything else is recoded as other
?
I think the easiest way is to relabel all the naics not in the top 8 to a special value.
data$naics[!(data$naics %in% top8)] = -99
Then you can use the "exclude" option when turning it into a factor
factor(data$naics, exclude=-99)
You can use forcats::fct_other()
:
library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')
Or using fct_other()
as a part of a dplyr::mutate()
:
library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other'))
data %>% head(10)
employees naics
1 420 other
2 264 other
3 189 other
4 157 621610
5 376 621610
6 236 other
7 658 621320
8 959 621320
9 216 other
10 156 other
Note that if the argument other_level
is not set, the other levels default to 'Other' (uppercase 'O').
And conversely, if you had a only a few factors you wanted converted to 'other', you could use the argument drop
instead:
data %>%
mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>%
head(10)
employees naics keep_fct drop_fct
1 474 621491 other 621491
2 805 621111 621111 other
3 434 621910 other 621910
4 845 621111 621111 other
5 243 621340 other 621340
6 466 621493 other 621493
7 369 621111 621111 other
8 57 621493 other 621493
9 144 621491 other 621491
10 786 621910 other 621910
dpylr
also has recode_factor()
where you can set the .default
argument to other, but with a larger number of levels to recode, like with this example, could be tedious:
data %>%
mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))
A late entry
Here is a wrapper for plyr::mapvalues
which allows the a remaining
argument (your other
)
library(plyr)
Mapvalues <- function(x, from, to, warn_missing= TRUE, remaining = NULL){
if(!is.null(remaining)){
therest <- setdiff(x, from)
from <- c(from, therest)
to <- c(to, rep_len(remaining, length(therest)))
}
mapvalues(x, from, to, warn_missing)
}
# replace the remaining values with "other"
Mapvalues(data$naics, top8, top8_desc,remaining = 'other')
# leave the remaining values alone
Mapvalues(data$naics, top8, top8_desc)
I have writen a function to do this that can be usefull to others may be?
I first check in a relative manner, if a level occures less then mp percent of the base. After that I check to limit the max number of levels to be ml.
ds is the data set at hand of type data.frame, I do this for all columns that appear in cat_var_names as factors.
cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])
recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
# remove less frequent levels in factor
#
n <- nrow(ds)
# keep levels with more then mp percent of cases
for (i in var_list){
keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
}
# keep top ml levels
for (i in var_list){
keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
}
return(ds)
}