How to finish code to replace NA with median in R

I am very new to R, so please please be gentle.

I am working on the Kaggle Titanic competition, to get me into R and working things out.

I am working my way through engineering a feature and I am a bit stuck with the logic of what to do next.

So, here goes. My goal is to take the Age data and replace all of the NA with the median of age for the title of the person. e.g. if the person is a master, I want to get the median of all the masters and replace the NA with that median. Same for Mr. and so on.

I have managed to create myself a data.frame containing title and age as follows:

library(tibble)
data.combined <-
  tibble(
    data.combined.new.title = c(
      "Mr.",
      "Mrs.",
      "Miss",
      "Mrs.",
      "Mr.",
      "Mr.",
      "Mr.",
      "Master",
      "Mrs."
    ),
    data.combined.Age = c(22, 38, 26, 35, 35, NA, 54, 2, 27)
  )

As you can see in this list there is a Mr. with and NA next to his age. I want to replace that NA with the Median of all the other Mr in the list.

so I have the following code up to the point where I can replace the NA's with the median of the whole data set.

#Creates my data.frame
agedata <- data.frame(data.combined$new.title, data.combined$Age)

#replace NA with the mean of the whole data set
agedata$data.combined.Age[is.na(agedata$data.combined.Age)] <- median(agedata$data.combined.Age, na.rm = TRUE)

What I just don't get is how would I add to this code to replace the NA by the median of the groups of title, Mr, Master, Mrs, Miss?

Any pointers are greatly received.

I'm not too interested in whether this is going to help with my prediction for Kaggle at this point, more with how the code should look.

Many Thanks in Advance.

标签： r replace na median kaggle

4条回答

Explosion°爆炸

2楼-- · 2019-08-26 08:06

Or maybe this tidyverse one-liner

agedata %>% group_by(title) %>% mutate(age=ifelse(is.na(age), median(age, na.rm=TRUE), age))

0人赞添加讨论(0) 举报

淡お忘

3楼-- · 2019-08-26 08:20

zz <- "group traits
BSPy01-10     NA
BSPy01-10    7.3
BSPy01-10    7.3
BSPy01-11    5.3
BSPy01-11    5.4
BSPy01-11    5.6
BSPy01-11     NA
BSPy01-11     NA
BSPy01-11    4.8
BSPy01-12    8.1
BSPy01-12    6.0
BSPy01-12    6.0
BSPy01-13    6.1"
Data <- read.table(text=zz, header = TRUE)

impute <- function(x, fun) {
missing <- is.na(x)
replace(x, missing, fun(x[!missing]))
}
ddply(Data, ~ group, transform, traits = impute(traits, median))

0人赞添加讨论(0) 举报

在下西门庆

4楼-- · 2019-08-26 08:21

This is probably not the most elegent way to do it but it works:

title <- c("Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master", "Mrs")
age <- c(22, 38, 26, 35, 35, NA, 54, 2, 27)
df = data.frame(title, age)

# get the medians by groups
medians = aggregate(df$age, list(df$title), median, na.rm = TRUE)
# match the missing ages with the medians thanks to the groups
df$age[is.na(df$age)] <- medians[array(medians$Group.1) == df$title[is.na(df$age)], "x"]

0人赞添加讨论(0) 举报

疯言疯语

5楼-- · 2019-08-26 08:31

library(data.table)

dt <- data.table(title = c("Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master", "Mrs"),
age = c(22, 38, 26, 35, 35, NA, 54, 2, 27))

dt[,avg_age:=median(age,na.rm=T),by="title"]
dt[is.na(age),age:=avg_age]
dt[,avg_age:=NULL]

0人赞添加讨论(0) 举报

How to finish code to replace NA with median in R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间