I have a very messy dataframe (webscraped) that unfortunately has many double and even triple entries in it. Most of the dataframe looks like this:
> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
var1 var2 var3
1 a right.a correct.a
2 a <NA> correct.a
3 b right.b correct.b
4 b <NA> correct.b
5 c right.c correct.c
6 c <NA> correct.c
7 d right.d correct.d
8 d <NA> correct.d
"var1" is my ID variable that I need to use to aggregate. My goal is to have a dataframe that looks like this:
var1 var2 var3
1 a right.a correct.a
2 b right.b correct.b
3 c right.c correct.c
4 d right.d correct.d
However, the main problem is, that not the whole dataframe looks like this. In fact, I have other parts that look like this:
> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
var1 var2 var3
1 e <NA> correct.e
2 e <NA> correct.e
3 e right.e <NA>
4 f <NA> correct.f
5 f <NA> <NA>
6 g <NA> correct.g
7 g right.g wrong.g
8 g <NA> <NA>
And other variations. In the end, every ID should have one row with the correct and right var2 and var3 in it. At this point, I get lost: My var1 is not unique. However, I know that duplicate IDs that "belong" together are grouped in the dataframe (as seen in my examples); e.g. there might be another "a" in row 4102 and 4103.
What I think would be the way to go is using aggregate with var1 as ID,but in addition telling R that aggregate should just check +-2 rows of var1 when doing so. Any ideas how to code this?
Thanks!
Here is a method using
data.table
and
The idea in
var2[!is.na(var2)][1]
for example, to take the first non-missing value from var2. If all values are missing, then this returns NA. This operation is performed for both variables by var1.If you have more than two variables, you might switch to
lapply
. For example, the following.In an instance where more than one var1 has a valid value, and this is indicated by a non-missing var2, then you can reach the intended result with a join.
The data from the comment,
Then, with this data,
Here, all non-missing var2 observations by var1 are merged onto the original data set.
If
var2
andvar3
have only one unique value for each level ofvar1
, then: