I am flummoxed. I am trying to isolate certain rows of df according to values in two columns. As always i try this in practice data first. My code works fine.
data1<-df2[df2$fruit=="kiwi" | df2$fruit=="orange" | df2$fruit=="apple" & (df2$dates>= "2010-04-01" & df2$dates< "2010-10-01"), ]
when I try the same code on my real data, it doesn't work. It collects the "fruits" I need, but ignores my date range request.
data1<-lti_first[lti_first$hai_atc=="C10AA01" | lti_first$hai_atc=="C10AA03" | lti_first$hai_atc=="C10AA04" | lti_first$hai_atc=="C10AA05" | lti_first$hai_atc=="C10AA07" | lti_first$hai_atc=="C10AB02" |lti_first$hai_atc=="C10AA04" |lti_first$hai_atc=="C10AB08" | lti_first$hai_atc=="C10AX09" & (lti_first$date_of_claim >= "2010-04-01" & lti_first$date_of_claim<"2010-10-01"), ]
the structure of the variables in my practice data and real data are the exact same. Fruits/hai_atc are factors in both dfs, dates are as.Dates in both dfs.
in an effort to get around this I've tried subsetting my data instead, but that won't work for me either (but does work on practice data)
x<-subset(lti_first, hai_atc=="V07AY03" | hai_atc=="A11JC94" & (date_of_claim>="2010-04-01" & date_of_claim<"2010-10-01"))
What am I doing wrong? To me, my code looks identical!
sample df
names<-c("tom", "mary", "tom", "john", "mary",
"tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01",
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
"2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01",
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
"apple", "apple", "apple", "orange", "banana", "apple",
"kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
"f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2
dput(df2)
structure(list(names = structure(c(3L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 1L, 2L, 3L, 2L, 1L, 1L), .Label = c("john", "mary", "tom"
), class = "factor"), dates = structure(c(14641, 14730, 14669,
14791, 14791, 14761, 14853, 14791, 14914, 14853, 14822, 14914,
14944, 14975), class = "Date"), age = c(60, 55, 60, 57, 55, 60,
57, 55, 57, 55, 60, 55, 57, 57), sex = structure(c(2L, 1L, 2L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L), .Label = c("f",
"m"), class = "factor"), fruit = structure(c(1L, 4L, 2L, 3L,
1L, 1L, 1L, 4L, 2L, 1L, 3L, 1L, 4L, 1L), .Label = c("apple",
"banana", "kiwi", "orange"), class = "factor")), .Names = c("names",
"dates", "age", "sex", "fruit"), row.names = c(NA, -14L), class = "data.frame")
**real data too big to put in dput, here's an str instead
str(sample_lti_first)
'data.frame': 20 obs. of 5 variables:
$ hai_dispense_number: Factor w/ 53485 levels "Patient HAI0000017",..: 22260 22260 2527 24311 24311 24311 24311 13674 13674 13674 ...
$ sex : Factor w/ 4 levels "F","M","U","X": 2 2 2 1 1 1 1 1 1 1 ...
$ hai_age : int 18 18 27 40 40 40 40 28 28 28 ...
$ date_of_claim : Date, format: "2009-10-09" "2009-10-09" "2009-10-18" ...
$ hai_atc : Factor w/ 1038 levels "","A01AA01","A01AB03",..: 144 76 859 80 1009 1009 859 81 1008 859 ...
Does this work?
Note the use of
%in%
andas.Date
.I think it's important to expand on @Aaron's comment. The problem you ran into was caused by a lack of parentheses around all the OR statements that using
%in%
avoided, and not that OR statements don't work in the extract function[
. Your toy example actually didn't work exactly how you wanted - there was anorange
fruit with the date2010-12-01
. That other problems didn't crop up was just chance.The way to read the boolean logic in this code
is:
That is what you got - only the apples were truncated to the appropriate date range. There were actually no kiwis outside the date range in the toy dataset.
Now add a pair of parentheses:
This code says:
That all being said,
%in%
is definitely the way to go.