-->

清理因子水平(折叠多层次/标签)清理因子水平(折叠多层次/标签)(Cleaning up facto

2019-05-08 20:12发布

什么是最有效的(即高效/适当的)的方式来清理含有需要被倒塌多层次的因素? 也就是说,如何将两个或多个因子水平合二为一。

这里就是两个层次“是”和“Y”应该被倒塌的例子为“是”,“否”和“N”晕倒“否”:

## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H")   # The 'H' should be treated as NA

## expectedOutput
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No  # <~~ NOTICE ONLY **TWO** LEVELS

一种选择是当然的清洁使用前手串sub和朋友。

另一种方法,是让重复的标签,然后把它们

## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)      
droplevels(x.f) 

然而, 有没有更有效的方法是什么?


虽然我知道, levelslabels的参数应该是向量,我列表和命名名单,并命名为载体实验,看看会发生什么不用说,有下列情形的让我更接近我的目标。

  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
  factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
  factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))

Answer 1:

使用levels的功能,并通过它命名列表,名称是水平的需要的名称和要素是应该改名为当前的名字。

x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
levels(x) <- list(Yes=c("Y", "Yes"), No=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>
## Levels: Yes No

随着中提到levels文件; 还看到有例子。

值:对于“因子”的方法,字符串的与长度的矢量的至少的“X”,或命名列表指定如何重命名层次等级的数目。

这也可以在一行中进行,如马立克在这里所做的: https://stackoverflow.com/a/10432263/210673 ; 该levels<-魔法这里要解释https://stackoverflow.com/a/10491881/210673 。

> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No


Answer 2:

因为这个问题是名为清理因子水平(压扁多层次/标签),forcats包应该在这里提及为好,为完整起见。 forcats在2016年八月出现在CRAN。

有清理因子水平提供了一些便利的功能:

x <- c("Y", "Y", "Yes", "N", "No", "H") 

library(forcats)

崩溃因子水平成手动定义的组

fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

手动更改因子水平

fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

自动重新标记因子水平,必要时崩溃

fun <- function(z) {
  z[z == "Y"] <- "Yes"
  z[z == "N"] <- "No"
  z[!(z %in% c("Yes", "No"))] <- NA
  z
}
fct_relabel(factor(x), fun)
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

需要注意的是fct_relabel()与因子水平的作品,所以它需要一个因素,因为第一个参数。 其他两个函数, fct_collapse()fct_recode()同时接受的字符向量其是未公开的特性。

通过重新排序首次亮相因子水平

由OP给出的预期输出

[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

在这里,他们出现在水平排序x是不同于缺省( ?factor一个因素的水平是默认排序 )。

以与预期的输出线,这可以通过使用可实现fct_inorder()折叠之前的水平:

fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")

都返回相同的顺序与水平的预期输出,现在。



Answer 3:

也许一个名为向量作为密钥可能是有用的:

> factor(unname(c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA)[x]))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: No Yes

这看起来非常相似,你的最后一次尝试......但是这一个工程:-)



Answer 4:

另一种方法是,使含有的映射的表:

# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))

fmap$ind[ match(x, fmap$values) ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

# or...

library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

我喜欢这种方式,因为它留下一个容易检查对象总结地图背后; 和data.table代码看起来就像任何其他在语法加入。


当然,如果你不希望像一个对象fmap总结了变化,它可以是一个“一班轮”:

library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes


Answer 5:

我不知道你的真实使用情况,但将strtrim在这里任何使用...

factor( strtrim( x , 1 ) , levels = c("Y" , "N" ) , labels = c("Yes" , "No" ) )
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: Yes No


Answer 6:

类似@阿龙的方法,但稍微简单的将是:

x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
# levels(x)  
# [1] "H"   "N"   "No"  "Y"   "Yes"
# NB: the offending levels are 1, 2, & 4
levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
x
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes


Answer 7:

I add this answer to demonstrate the accepted answer working on a specific factor in a dataframe, since this was not initially obvious to me (though it probably should have been).

levels(df$var1)
# "0" "1" "Z"
summary(df$var1)
#    0    1    Z 
# 7012 2507    8 
levels(df$var1) <- list("0"=c("Z", "0"), "1"=c("1"))
levels(df$var1)
# "0" "1"
summary(df$var1)
#    0    1 
# 7020 2507


Answer 8:

您可以使用下面的功能组合/折叠多重因素:

combofactor <- function(pattern_vector,
         replacement_vector,
         data) {
 levels <- levels(data)
 for (i in 1:length(pattern_vector))
      levels[which(pattern_vector[i] == levels)] <-
        replacement_vector[i]
 levels(data) <- levels
  data
}

例:

初始化X

x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
rep("yes",20),rep("Yes",20),rep("No",20)))

检查结构

str(x)
# Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...

使用功能:

x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)

重新检查结构:

str(x_new)
# Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...


Answer 9:

首先,我们注意到,在这种特殊情况下,我们可以使用部分匹配:

x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c("Yes","No")
x <- factor(y[pmatch(x,y,duplicates.ok = TRUE)])
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

在更一般的情况下,我会跟去dplyr::recode

library(dplyr)
x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c(Y="Yes",N="No")
x <- recode(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

稍微改变,如果出发点是一个因素:

x <- factor(c("Y", "Y", "Yes", "N", "No", "H"))
y <- c(Y="Yes",N="No")
x <- recode_factor(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No


文章来源: Cleaning up factor levels (collapsing multiple levels/labels)
标签: r factors r-faq