Drop factor levels in a subsetted data frame

2018-12-31 00:56发布

I have a data frame containing a factor. When I create a subset of this data frame using subset() or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels -- even when they do not exist in the new data frame.

This creates headaches when doing faceted plotting or using functions that rely on factor levels.

What is the most succinct way to remove levels from a factor in my new data frame?

Here's my example:

df <- data.frame(letters=letters[1:5],
                    numbers=seq(1:5))

levels(df$letters)
## [1] "a" "b" "c" "d" "e"

subdf <- subset(df, numbers <= 3)
##   letters numbers
## 1       a       1
## 2       b       2
## 3       c       3    

## but the levels are still there!
levels(subdf$letters)
## [1] "a" "b" "c" "d" "e"

13条回答
春风洒进眼中
2楼-- · 2018-12-31 01:34

Here's another way, which I believe is equivalent to the factor(..) approach:

> df <- data.frame(let=letters[1:5], num=1:5)
> subdf <- df[df$num <= 3, ]

> subdf$let <- subdf$let[ , drop=TRUE]

> levels(subdf$let)
[1] "a" "b" "c"
查看更多
旧人旧事旧时光
3楼-- · 2018-12-31 01:34

When i am working with data.frame, I now use options(stringsAsFactors = FALSE) at the beginning of the script. Hence, characters remain characters. Since, i do not get any problems with factors any more :)

查看更多
梦醉为红颜
4楼-- · 2018-12-31 01:35

It is a known issue, and one possible remedy is provided by drop.levels() in the gdata package where your example becomes

> drop.levels(subdf)
  letters numbers
1       a       1
2       b       2
3       c       3
> levels(drop.levels(subdf)$letters)
[1] "a" "b" "c"

There is also the dropUnusedLevels function in the Hmisc package. However, it only works by altering the subset operator [ and is not applicable here.

As a corollary, a direct approach on a per-column basis is a simple as.factor(as.character(data)):

> levels(subdf$letters)
[1] "a" "b" "c" "d" "e"
> subdf$letters <- as.factor(as.character(subdf$letters))
> levels(subdf$letters)
[1] "a" "b" "c"
查看更多
初与友歌
5楼-- · 2018-12-31 01:38

Since R version 2.12, there's a droplevels() function.

levels(droplevels(subdf$letters))
查看更多
孤独寂梦人
6楼-- · 2018-12-31 01:41

If you don't want this behaviour, don't use factors, use character vectors instead. I think this makes more sense than patching things up afterwards. Try the following before loading your data with read.table or read.csv:

options(stringsAsFactors = FALSE)

The disadvantage is that you're restricted to alphabetical ordering. (reorder is your friend for plots)

查看更多
不流泪的眼
7楼-- · 2018-12-31 01:48

Very interesting thread, I especially liked idea to just factor subselection again. I had the similar problem before and I just converted to character and then back to factor.

   df <- data.frame(letters=letters[1:5],numbers=seq(1:5))
   levels(df$letters)
   ## [1] "a" "b" "c" "d" "e"
   subdf <- df[df$numbers <= 3]
   subdf$letters<-factor(as.character(subdf$letters))
查看更多
登录 后发表回答