I have a data frame containing a factor. When I create a subset of this data frame using subset()
or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels -- even when they do not exist in the new data frame.
This creates headaches when doing faceted plotting or using functions that rely on factor levels.
What is the most succinct way to remove levels from a factor in my new data frame?
Here's my example:
df <- data.frame(letters=letters[1:5],
numbers=seq(1:5))
levels(df$letters)
## [1] "a" "b" "c" "d" "e"
subdf <- subset(df, numbers <= 3)
## letters numbers
## 1 a 1
## 2 b 2
## 3 c 3
## but the levels are still there!
levels(subdf$letters)
## [1] "a" "b" "c" "d" "e"
Looking at the
droplevels
methods code in the R source you can see it wraps tofactor
function. That means you can basically recreate the column withfactor
function.Below the data.table way to drop levels from all the factor columns.
For the sake of completeness, now there is also
fct_drop
in theforcats
package http://forcats.tidyverse.org/reference/fct_drop.html.It differs from
droplevels
in the way it deals withNA
:All you should have to do is to apply factor() to your variable again after subsetting:
EDIT
From the factor page example:
For dropping levels from all factor columns in a dataframe, you can use:
This is obnoxious. This is how I usually do it, to avoid loading other packages:
which gets you:
Note that the new levels will replace whatever occupies their index in the old levels(subdf$letters), so something like:
won't work.
This is obviously not ideal when you have lots of levels, but for a few, it's quick and easy.
here is a way of doing that
Another way of doing the same but with
dplyr
Edit:
Also Works ! Thanks to agenis