I am flummoxed. I am trying to isolate certain rows of df according to values in two columns. As always i try this in practice data first. My code works fine.
data1<-df2[df2$fruit=="kiwi" | df2$fruit=="orange" | df2$fruit=="apple" & (df2$dates>= "2010-04-01" & df2$dates< "2010-10-01"), ]
when I try the same code on my real data, it doesn't work. It collects the "fruits" I need, but ignores my date range request.
data1<-lti_first[lti_first$hai_atc=="C10AA01" | lti_first$hai_atc=="C10AA03" | lti_first$hai_atc=="C10AA04" | lti_first$hai_atc=="C10AA05" | lti_first$hai_atc=="C10AA07" | lti_first$hai_atc=="C10AB02" |lti_first$hai_atc=="C10AA04" |lti_first$hai_atc=="C10AB08" | lti_first$hai_atc=="C10AX09" & (lti_first$date_of_claim >= "2010-04-01" & lti_first$date_of_claim<"2010-10-01"), ]
the structure of the variables in my practice data and real data are the exact same. Fruits/hai_atc are factors in both dfs, dates are as.Dates in both dfs.
in an effort to get around this I've tried subsetting my data instead, but that won't work for me either (but does work on practice data)
x<-subset(lti_first, hai_atc=="V07AY03" | hai_atc=="A11JC94" & (date_of_claim>="2010-04-01" & date_of_claim<"2010-10-01"))
What am I doing wrong? To me, my code looks identical!
sample df
names<-c("tom", "mary", "tom", "john", "mary",
"tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01",
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
"2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01",
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
"apple", "apple", "apple", "orange", "banana", "apple",
"kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
"f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2
dput(df2)
structure(list(names = structure(c(3L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 1L, 2L, 3L, 2L, 1L, 1L), .Label = c("john", "mary", "tom"
), class = "factor"), dates = structure(c(14641, 14730, 14669,
14791, 14791, 14761, 14853, 14791, 14914, 14853, 14822, 14914,
14944, 14975), class = "Date"), age = c(60, 55, 60, 57, 55, 60,
57, 55, 57, 55, 60, 55, 57, 57), sex = structure(c(2L, 1L, 2L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L), .Label = c("f",
"m"), class = "factor"), fruit = structure(c(1L, 4L, 2L, 3L,
1L, 1L, 1L, 4L, 2L, 1L, 3L, 1L, 4L, 1L), .Label = c("apple",
"banana", "kiwi", "orange"), class = "factor")), .Names = c("names",
"dates", "age", "sex", "fruit"), row.names = c(NA, -14L), class = "data.frame")
**real data too big to put in dput, here's an str instead
str(sample_lti_first)
'data.frame': 20 obs. of 5 variables:
$ hai_dispense_number: Factor w/ 53485 levels "Patient HAI0000017",..: 22260 22260 2527 24311 24311 24311 24311 13674 13674 13674 ...
$ sex : Factor w/ 4 levels "F","M","U","X": 2 2 2 1 1 1 1 1 1 1 ...
$ hai_age : int 18 18 27 40 40 40 40 28 28 28 ...
$ date_of_claim : Date, format: "2009-10-09" "2009-10-09" "2009-10-18" ...
$ hai_atc : Factor w/ 1038 levels "","A01AA01","A01AB03",..: 144 76 859 80 1009 1009 859 81 1008 859 ...