I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing categories with few observations into "other" and am looking for a quick way to do that--I have a perhaps 20 levels of a variable, but am interested in collapsing a bunch of them to one.
data <- data.frame(employees = sample.int(1000,500),
naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
100, replace=T))
Here are my levels of interest, and their labels in separate vectors.
#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
'621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
'Offices of dentists',
'Offices of all other miscellaneous health practitioners',
'Home health care services',
'Offices of Mental Health Practitioners',
'Offices of chiropractors',
'Medical Laboratories',
'Outpatient Mental Health and Substance Abuse Centers',
'Offices of optometrists')
I could use the factor()
call, enumerate them all, classifying as "other" for each time a category had few observations.
Assuming that the top8
and top8_desc
above are the actual top 8, what is the best way to declare data$naics
as a factor variable so that the values in top8
are correcly coded and everything else is recoded as other
?
I think the easiest way is to relabel all the naics not in the top 8 to a special value.
Then you can use the "exclude" option when turning it into a factor
A late entry
Here is a wrapper for
plyr::mapvalues
which allows the aremaining
argument (yourother
)I have writen a function to do this that can be usefull to others may be? I first check in a relative manner, if a level occures less then mp percent of the base. After that I check to limit the max number of levels to be ml.
ds is the data set at hand of type data.frame, I do this for all columns that appear in cat_var_names as factors.
You can use
forcats::fct_other()
:Or using
fct_other()
as a part of adplyr::mutate()
:Note that if the argument
other_level
is not set, the other levels default to 'Other' (uppercase 'O').And conversely, if you had a only a few factors you wanted converted to 'other', you could use the argument
drop
instead:dpylr
also hasrecode_factor()
where you can set the.default
argument to other, but with a larger number of levels to recode, like with this example, could be tedious: