R：因子水平，重新编码其余的“其他”(R: factor levels, recode rest t

我使用的因素有所很少，一般发现他们理解，但我常常很模糊有关具体操作的细节。目前，我的编码/压缩类别有一些意见为“其他”，并正在寻找一种快速的方法来做到这一点 - 我有一个也许20个级别的变量，但我感兴趣的崩溃一群人一个。

data <- data.frame(employees = sample.int(1000,500),
                   naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
                                  100, replace=T))

这里是我的利率水平，以及它们在不同载体的标签。

#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
         '621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
               'Offices of dentists',
               'Offices of all other miscellaneous health practitioners',
               'Home health care services',
               'Offices of Mental Health Practitioners',
               'Offices of chiropractors',
               'Medical Laboratories',
               'Outpatient Mental Health and Substance Abuse Centers',
               'Offices of optometrists')

我可以用factor()调用，枚举所有这些，为“其他”分类，每次类别有一些看法的。

假设top8和top8_desc以上是实际的顶部8，什么是申报的最佳途径data$naics作为一个因素变量，以便在值top8被正确地将编码和其他一切都重新编码为other ？

Answer 1:

我认为，最简单的方法是不重新标记所有NAICS在8强的特殊价值。

data$naics[!(data$naics %in% top8)] = -99

然后，你可以把它变成一个因素，当使用“排除”选项

factor(data$naics, exclude=-99)

Answer 2:

您可以使用forcats::fct_other()

library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')

或使用fct_other()的一部分dplyr::mutate()

library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other')) 

data %>% head(10)
   employees  naics
1        420  other
2        264  other
3        189  other
4        157 621610
5        376 621610
6        236  other
7        658 621320
8        959 621320
9        216  other
10       156  other

需要注意的是，如果参数other_level没有设置，其他等级默认为“其他”（大写的“O”）。

反之，如果你有一个只与您想转换为“其他”的几个因素，你可以使用参数drop ，而不是：

data %>%  
  mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
         drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>% 
  head(10)

   employees  naics keep_fct drop_fct
1        474 621491    other   621491
2        805 621111   621111    other
3        434 621910    other   621910
4        845 621111   621111    other
5        243 621340    other   621340
6        466 621493    other   621493
7        369 621111   621111    other
8         57 621493    other   621493
9        144 621491    other   621491
10       786 621910    other   621910

dpylr也有recode_factor()在这里你可以设定.default参数等，但随着水平的较大数量的重新编写，就像这个例子，可能是乏味的：

data %>% 
   mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))

Answer 3:

迟进入

下面是一个包装plyr::mapvalues允许一个remaining参数（你的other ）

library(plyr)

Mapvalues <- function(x, from, to, warn_missing= TRUE, remaining = NULL){
  if(!is.null(remaining)){
    therest <- setdiff(x, from)
    from <- c(from, therest)
    to <- c(to, rep_len(remaining, length(therest)))
  }
  mapvalues(x, from, to, warn_missing)
}
# replace the remaining values with "other"
Mapvalues(data$naics, top8, top8_desc,remaining = 'other')
# leave the remaining values alone
Mapvalues(data$naics, top8, top8_desc)

Answer 4:

我已经writen一个函数来做到这一点，可能是有用的给别人呢？我以相对的方式首先检查，如果电平occures小于所述基部的熔点百分比。从那以后，我要检查各级限制的最大数量为毫升。

DS是数据设置为类型data.frame的手，我这样做了出现在cat_var_names的因素都列。

cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])

recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
  # remove less frequent levels in factor
  # 
  n <- nrow(ds)
  # keep levels with more then mp percent of cases
  for (i in var_list){
    keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }

  # keep top ml levels
  for (i in var_list){
    keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }
  return(ds)
}

文章来源: R: factor levels, recode rest to 'other'