How to filter data in R?

2019-02-18 09:36发布

问题:

I have huge data sets which contains more than millions of rows and has some peculiar attributes. I need to filter the data retaining its other properties.

My data is as like following:

      ID   Prop1   Prop2   TotalProp
56891940     G02     G02           2
56892558     A61     G02           4
56892558     A61     A61           4
56892558     G02     A61           4
56892558     A61     A61           4
56892552     B61     B61           3
56892552     B61     B61           3
56892552     B61     A61           3
56892559     B61     G61           3
56892559     B61     B61           3
56892559     B61     B61           3 and so on more than million rows

What I want is, I need to remove rows if all rows ID having 56891940 and 56892559 which have "prop1" and "prop2" same but not 56892558 and 56892559 because some rows are same but at least one of its properties are different so I want to retain all values from 56892558,56892552 and 56892559 and so on.

My final output should look like:

      ID   Prop1   Prop2   TotalProp
56892558     A61     G02           4
56892558     A61     A61           4
56892558     G02     A61           4
56892558     A61     A61           4
56892552     B61     B61           3
56892552     B61     B61           3
56892552     B61     A61           3    
56892559     B61     G61           3
56892559     B61     C61           3
56892559     B61     B61           3

回答1:

You may try

library(data.table)
setDT(df1)[, .SD[any(Prop1!=Prop2)], ID]
#          ID Prop1 Prop2 TotalProp
# 1: 56892558   A61   G02         4
# 2: 56892558   A61   A61         4
# 3: 56892558   G02   A61         4
# 4: 56892558   A61   A61         4
# 5: 56892552   B61   B61         3
# 6: 56892552   B61   B61         3
# 7: 56892552   B61   A61         3
# 8: 56892559   B61   G61         3
# 9: 56892559   B61   B61         3
#10: 56892559   B61   B61         3

Or as @Frank suggested

setDT(df1)[, if(any(Prop1!=Prop2)) .SD, ID]

Similar option using dplyr

library(dplyr)
df1 %>%
    group_by(ID) %>%
    filter(any(Prop1!=Prop2))

Or using ave from base R

df1[with(df1, ave(Prop1!=Prop2, ID, FUN=any)),]