I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:
houseID year price
1 1995 NA
1 1996 100
1 1997 NA
1 1998 120
1 1999 NA
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 NA
3 1995 NA
3 1996 44
3 1997 NA
3 1998 NA
3 1999 NA
I would like to make a data frame like this:
houseID year price
1 1995 NA
1 1996 100
1 1997 100
1 1998 120
1 1999 120
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 30
3 1995 NA
3 1996 44
3 1997 44
3 1998 44
3 1999 44
Here are some data in the right format:
# Number of houses
N = 15
# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))
Is there a dplyr-way to do that?
You can do a rolling self-join, supported by
data.table
:These all use
na.locf
from the zoo package:dplyr
giving:
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
Another possibility is to combine the
by
solution (shown further below) with dplyr:by
ave
data.table
zoo This solution uses zoo alone. It returns a wide rather than long result:
giving:
This solution could be combined with dplyr like this:
input
Here is the input used for the examples above:
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr.
tidyr::fill
now makes this stupidly easy:Pure dplyr solution (no zoo).
Intresting part of example solution is at the end of df2.
Here a dplyr and imputeTS combination.
You could also replace na.locf with more advanced missing data replacement (imputation) functions from imputeTS. For example na.interpolation or na.kalman. For this just replace na.locf with the name of the function you like.
without dplyr: