Dealing with non-proportional hazards in R

R-bloggers 2016-03-29

Summary:

[caption id="attachment_1837" align="aligncenter" width="640"] As things change over time so should our statistical models. The image is CC by Prad Prathivi[/caption]

Since I'm frequently working with large datasets and survival data I often find that the proportional hazards assumption for the Cox regressions doesn't hold. In my most recent study on cardiovascular deaths after total hip arthroplasty the coefficient was close to zero when looking at the period between 5 and 21 years after surgery. Grambsch and Thernau's test for non-proportionality hinted though of a problem and as I explored it there was a clear correlation between mortality and hip arthroplasty surgery. The effect increased over time, just as we had originally thought, see below figure. In this post I'll try to show how I handle with non-proportional hazards in R.

[toc]

Why should we care?

As we scale up to larger datasets we are increasingly looking at smaller effects. This causes the models to become more susceptible to residual bias and problems in model assumptions.

Missing the exposure

In the example above I would have missed the fact that a hip implant may affect mortality in the long run. While the effect isn't strong, it is one of the most frequent procedures and we have reasons to believe that subgroups may have an even higher risk of dying than the average patient with an hip implant.

Insufficient adjustment for confounding

If we assume that a variable is enough important to adjust for we also must accept that we should model it in optimal way so that all the confounding can't affect our exposure variable.

What is the Cox model all about?

The key to understanding the Cox regression is grasping the concept of hazard. All regression models are basically trying to express (should be familiar from high-school):

[latex]y = A + B * X[/latex]

The core idea is that we can choose what we want to have as the y. Cox stated that if we assume that the proportion between hazards remains the same we can use the logarithm of the hazards function (h(t)) as the y:

[latex]h(t) = frac{f(t)}{S(t)}[/latex]

Here the f(t) is the risk of dying at a certain moment in time while having survived that far, the S(t). In practice if we look at a certain moment we can estimate how many have made to that time point and then look at how many died at that particular point in time. An important resulting effect from this is that we can include a patient that appears after the study start and include that patient from that time point and onward. E.g. if Peter is operated with a hip arthroplasty in England, arrives after 1 year to Sweden, he has been alive for 1 year when we choose to include him. Note that any patient that would have died prior to that would never have come to our attention and we can therefore not include Peter in the S(t) when looking at the hazard prior to 1 year.

The start time is the key

The core idea of dealing with proportional hazards and time varying coefficients in a Cox model is to split the time and use an interaction term. We can do this similar to including Peter in the example above. We choose a suitable time interval and split all observations accordingly. If a patient is alive at the end of the interval he/she is simply marked as censored even if he/she eventually suffers an event. The advantage is that now we have a dataset where we have several starting times that we can use as interaction terms.

Important note: Always use the starting time and never the end time as the end time is invariably related to the event. Without time-splitting this is obvious as the interaction becomes highly significant and suddenly deviates sharply from the original estimate. If we have a more fine-grained time-split this becomes less of a problem but I still believe that it makes more sense in using the starting time as this should be completely free of all bias.

Splitting time with the `Greg::timesplitter` (R-code starts here)

My Greg-package (now on CRAN) contains a timeSplitter function that is a wrapper for the Epi package's Lexis functionality. To illustrate how this works we start with 4 patients:

test_data <- data.frame(  id = 1:4,  time = c(4, 3.5, 1, 5),  event = c("censored", "dead", "alive", "dead"),  age = c(62.2, 55.3, 73.7, 46.3),  date = as.Date(    c("2003-01-01",       "2010-04-01",

Authors:

Max Gordon

Date tagged:

03/29/2016, 19:03

Date published:

03/29/2016, 05:12