I have a data frame containing a factor. When I create a subset of this data frame using subset()
or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels -- even when they do not exist in the new data frame.
This creates headaches when doing faceted plotting or using functions that rely on factor levels.
What is the most succinct way to remove levels from a factor in my new data frame?
Here's my example:
df <- data.frame(letters=letters[1:5],
numbers=seq(1:5))
levels(df$letters)
## [1] "a" "b" "c" "d" "e"
subdf <- subset(df, numbers <= 3)
## letters numbers
## 1 a 1
## 2 b 2
## 3 c 3
## but the levels are still there!
levels(subdf$letters)
## [1] "a" "b" "c" "d" "e"
Here's another way, which I believe is equivalent to the
factor(..)
approach:When i am working with
data.frame
, I now useoptions(stringsAsFactors = FALSE)
at the beginning of the script. Hence, characters remain characters. Since, i do not get any problems with factors any more :)It is a known issue, and one possible remedy is provided by
drop.levels()
in the gdata package where your example becomesThere is also the
dropUnusedLevels
function in the Hmisc package. However, it only works by altering the subset operator[
and is not applicable here.As a corollary, a direct approach on a per-column basis is a simple
as.factor(as.character(data))
:Since R version 2.12, there's a
droplevels()
function.If you don't want this behaviour, don't use factors, use character vectors instead. I think this makes more sense than patching things up afterwards. Try the following before loading your data with
read.table
orread.csv
:The disadvantage is that you're restricted to alphabetical ordering. (reorder is your friend for plots)
Very interesting thread, I especially liked idea to just factor subselection again. I had the similar problem before and I just converted to character and then back to factor.