I'm performing survival analysis in R using the 'survival' package and coxph
. My goal is to compare survival between individuals with different chronic diseases. My data are structured like this:
id, time, event, disease, age.at.dx
1, 342, 0, A, 8247
2, 2684, 1, B, 3879
3, 7634, 1, A, 3847
where 'time' is the number of days from diagnosis to event, 'event' is 1 if the subject died, 0 if censored, 'disease' is a factor with 8 levels, and 'age.at.dx' is the age in days when the subject was first diagnosed. I am new to using survival analysis. Looking at the cox.zph output for a model like this:
combi.age<-coxph(Surv(time,event)~disease+age.at.dx,data=combi)
Two of the disease levels violate the PH assumption, having p-values <0.05. Plotting the Schoenfeld residuals over time shows that for one disease the hazard falls steadily over time, and with the second, the line is predominantly parallel, but with a small upswing at the extreme left of the graph.
My question is how to deal with these disease levels? I'm aware from my reading that I should attempt to add a time interaction to the disease whose hazard drops steadily, but I'm unsure how to do this, given that most examples of coxph
I've come across only compare two groups, whereas I am comparing 8. Also, can I safely ignore the assumption violation of the disease level with the high hazard at early time points?
I wonder whether this is an inappropriate way to structure my data, because it does not preclude a single individual appearing multiple times in the data - is this a problem?
Thanks for any help, please let me know if more information is needed to answer these questions.
I'd say you have a fairly good understanding of the data already and should present what you found. This sounds like a descriptive study rather than one where you will be presenting to the FDA with a request to honor your p-values. Since your audience will (or should) be expecting that the time-course of risk for different diseases will be heterogeneous, I'd think you can just describe these results and talk about the biological/medical reasons why the first "non-conformist" disease becomes less important with time and the other non-conforming condition might become more potent over time. You already done a more thorough analysis than most descriptive articles in the medical literature exhibit. I rarely see description of the nature of non-proportionality.
The last question regarding data "does not preclude a single individual appearing multiple times in the data" may require some more thorough discussion. The first approach would be to stratify by patient ID with the
cluster()
-function.