Aggregating based on “near” row values

2019-07-14 18:45发布

I have a very messy dataframe (webscraped) that unfortunately has many double and even triple entries in it. Most of the dataframe looks like this:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d

"var1" is my ID variable that I need to use to aggregate. My goal is to have a dataframe that looks like this:

  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d

However, the main problem is, that not the whole dataframe looks like this. In fact, I have other parts that look like this:

> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>

And other variations. In the end, every ID should have one row with the correct and right var2 and var3 in it. At this point, I get lost: My var1 is not unique. However, I know that duplicate IDs that "belong" together are grouped in the dataframe (as seen in my examples); e.g. there might be another "a" in row 4102 and 4103.

What I think would be the way to go is using aggregate with var1 as ID,but in addition telling R that aggregate should just check +-2 rows of var1 when doing so. Any ideas how to code this?

Thanks!

2条回答
ゆ 、 Hurt°
2楼-- · 2019-07-14 18:59

Here is a method using data.table

library(data.table)

setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

and

setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    e right.e correct.e
2:    f      NA correct.f
3:    g right.g correct.g

The idea in var2[!is.na(var2)][1] for example, to take the first non-missing value from var2. If all values are missing, then this returns NA. This operation is performed for both variables by var1.

If you have more than two variables, you might switch to lapply. For example, the following.

df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
   var1    var2      var3
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

In an instance where more than one var1 has a valid value, and this is indicated by a non-missing var2, then you can reach the intended result with a join.

The data from the comment,

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
                var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
                var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

Then, with this data,

setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
   var1     var2       var3
1:    a  right.a  correct.a
2:    a right.a1 correct.a1
3:    b  right.b  correct.b
4:    c  right.c  correct.c
5:    d  right.d  correct.d

Here, all non-missing var2 observations by var1 are merged onto the original data set.

查看更多
三岁会撩人
3楼-- · 2019-07-14 19:16

If var2 and var3 have only one unique value for each level of var1, then:

library(dplyr)

df = rbind(df1,df2)

df %>% group_by(var1) %>%
  summarise_all(funs(.[!is.na(.)][1]))
   var1    var2      var3
1     a right.a correct.a
2     b right.b correct.b
3     c right.c correct.c
4     d right.d correct.d
5     e right.e correct.e
6     f    <NA> correct.f
7     g right.g correct.g
查看更多
登录 后发表回答