Example data:
set.seed(1)
df <- data.frame(years=sort(rep(2005:2010, 12)),
months=1:12,
value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
head(df)
years months value
1 2005 1 -0.6264538
2 2005 2 0.1836433
3 2005 3 -0.8356286
4 2005 4 1.5952808
5 2005 5 0.3295078
6 2005 6 -0.8204684
Tell me please, how i can replace NA in df$value to median of others months? "value" must contain the median of value of all previous values for the same month. That is, if current month is May, "value" must contain the median value for all previous values of the month of May.
you want to use the test
is.na
function:which says for all the values where
df$value
isNA
, replace it with the right hand side. You need thena.rm=TRUE
piece or else themedian
function will returnNA
to do this month by month, there are many choices, but i think
plyr
has the simplest syntax:you can also use
data.table
. this is an especially good choice if your data is large:There are many other ways, but there are two!
Here's the most robust solution I can think of. It ensures the years are ordered correctly and will correctly compute the median for all previous months in cases where you have multiple years with missing values.
Or with ave
Since there are so many answers let's see which is fastest.
I would have bet that data.table was the fastest.
[ Matthew Dowle ] The task being timed here takes at most 0.02 seconds (2.075/100).
data.table
considers that insignificant. Try settingreplications
to1
and increasing the data size, instead. Or timing the fastest of 3 runs is also a common rule of thumb. More verbose discussion in these links :This is a way using
plyr
, it is not very pretty but I think it does what you want:Sticking with base R, you can also try the following:
There is another way to do this with
dplyr
.If you want to replace all columns with their median, do:
If you want to replace a subset of columns (such as "value" in OP's example), do: