How to finish code to replace NA with median in R

2019-08-26 07:24发布

I am very new to R, so please please be gentle.

I am working on the Kaggle Titanic competition, to get me into R and working things out.

I am working my way through engineering a feature and I am a bit stuck with the logic of what to do next.

So, here goes. My goal is to take the Age data and replace all of the NA with the median of age for the title of the person. e.g. if the person is a master, I want to get the median of all the masters and replace the NA with that median. Same for Mr. and so on.

I have managed to create myself a data.frame containing title and age as follows:

library(tibble)
data.combined <-
  tibble(
    data.combined.new.title = c(
      "Mr.",
      "Mrs.",
      "Miss",
      "Mrs.",
      "Mr.",
      "Mr.",
      "Mr.",
      "Master",
      "Mrs."
    ),
    data.combined.Age = c(22, 38, 26, 35, 35, NA, 54, 2, 27)
  )

enter image description here

As you can see in this list there is a Mr. with and NA next to his age. I want to replace that NA with the Median of all the other Mr in the list.

so I have the following code up to the point where I can replace the NA's with the median of the whole data set.

#Creates my data.frame
agedata <- data.frame(data.combined$new.title, data.combined$Age)

#replace NA with the mean of the whole data set
agedata$data.combined.Age[is.na(agedata$data.combined.Age)] <- median(agedata$data.combined.Age, na.rm = TRUE)

What I just don't get is how would I add to this code to replace the NA by the median of the groups of title, Mr, Master, Mrs, Miss?

Any pointers are greatly received.

I'm not too interested in whether this is going to help with my prediction for Kaggle at this point, more with how the code should look.

Many Thanks in Advance.

4条回答
Explosion°爆炸
2楼-- · 2019-08-26 08:06

Or maybe this tidyverse one-liner

agedata %>% group_by(title) %>% mutate(age=ifelse(is.na(age), median(age, na.rm=TRUE), age))
查看更多
淡お忘
3楼-- · 2019-08-26 08:20
zz <- "group traits
BSPy01-10     NA
BSPy01-10    7.3
BSPy01-10    7.3
BSPy01-11    5.3
BSPy01-11    5.4
BSPy01-11    5.6
BSPy01-11     NA
BSPy01-11     NA
BSPy01-11    4.8
BSPy01-12    8.1
BSPy01-12    6.0
BSPy01-12    6.0
BSPy01-13    6.1"
Data <- read.table(text=zz, header = TRUE)

impute <- function(x, fun) {
missing <- is.na(x)
replace(x, missing, fun(x[!missing]))
}
ddply(Data, ~ group, transform, traits = impute(traits, median))
查看更多
在下西门庆
4楼-- · 2019-08-26 08:21

This is probably not the most elegent way to do it but it works:

title <- c("Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master", "Mrs")
age <- c(22, 38, 26, 35, 35, NA, 54, 2, 27)
df = data.frame(title, age)

# get the medians by groups
medians = aggregate(df$age, list(df$title), median, na.rm = TRUE)
# match the missing ages with the medians thanks to the groups
df$age[is.na(df$age)] <- medians[array(medians$Group.1) == df$title[is.na(df$age)], "x"]
查看更多
疯言疯语
5楼-- · 2019-08-26 08:31

library(data.table)

dt <- data.table(title = c("Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master", "Mrs"),
age = c(22, 38, 26, 35, 35, NA, 54, 2, 27))

dt[,avg_age:=median(age,na.rm=T),by="title"]
dt[is.na(age),age:=avg_age]
dt[,avg_age:=NULL]
查看更多
登录 后发表回答