I have a dataset that looks something like this:
df <- structure(list(Claim.Num = c(500L, 500L, 600L, 600L, 700L, 700L,
100L, 200L, 300L), Amount = c(NA, 1000L, NA, 564L, 0L, 200L,
NA, 0L, NA), Company = structure(c(NA, 1L, NA, 4L, 2L, 3L, NA,
3L, NA), .Label = c("ATT", "Boeing", "Petco", "T Mobile"), class = "factor")), .Names =
c("Claim.Num", "Amount", "Company"), class = "data.frame", row.names = c(NA,
-9L))
I want to remove duplicate rows based on Claim Num values, but to remove duplicates based on the following criteria: df$Company == 'NA' | df$Amount == 0
In other words, remove records 1, 3, and 5.
I've gotten this far: df <- df[!duplicated(df$Claim.Num[which(df$Amount = 0 | df$Company == 'NA')]),]
The code runs without errors, but doesn't actually remove duplicate rows based on the required criteria. I think that's because I'm telling it to remove any duplicate Claim Nums which match to those criteria, but not to remove any duplicate Claim.Num
but treat certain Amounts & Companies preferentially for removal. Please note that, I can't simple filter out the dataset based on specified values, as there are other records that may have 0 or NA values, that require inclusion (e.g. records 8 & 9 shouldn't be excluded because their Claim.Nums are not duplicated).
Slightly different approach
This adds a column
x
to indicate which rows haveClaim.Num
present more than once, then filters the result based on your criteria. The use of-ncol(r)
just removes the columnx
at the end.If you order your data frame first, then you can make sure
duplicated
keeps the ones you want:Another way based on
subset
and logical indices: