Survival Analysis part 4: Cox PH modelling

Here we demonstrate a standard semi-parametric regression method to create a model of harddrive failures. This model can be tested for accuracy and used for prediction.

In the previous post we used a very basic non-parametric model (Kaplan-Meier) to produce what is effectively a simple count of events over time. These Kaplan-Meier models are extremely quick to run but limited by simplicity; whilst we can compare the empirical survival functions amongst different subsets of the data, we can't easily make comparative statements nor make predictive statements for new unseen data.

What we want is a regression model which takes not only the time_to_event and event_outcome features as before, but also a potentially unlimited set of set of exogenous features to predict the survival or hazard rate at each point in time. The Cox Proportional Hazards model (Cox PH) is a such a survival regression model.

Cox Proportional Hazards Modelling

The Cox PH model gives a semi-parametric method of estimating the hazard function $\lambda$ at time $t$ given a baseline hazard that's modified by a set of covariates:

$$\lambda(t|X) = \lambda_0(t)\exp(\beta_1X_1 + \cdots + \beta_pX_p) = \lambda_0(t)\exp(\bf{\beta}\bf{X})$$

where $\lambda_0(t)$ is the non-parametric baseline hazard function and $\bf{\beta}\bf{X}$ is a linear parametric model using features of the individuals, transformed by an exponential function.

The algorithm runs quite quickly, performing iterative partial regression of the time-dependent $\lambda_0(t)$, and the feature-dependent $\exp(\bf{\beta}\bf{X})$, repeated until convergence.

The big strength of Cox PH is to:

  • allow us to compare the relative hazard rates of the different features by comparing the $\beta$ coefficients, and to
  • compute a predicted survival function for new datapoints simply by plugging the details into the model.

One big weakness of the Cox PH is that the $\beta$ values are assumed constant over all time, which is often untrue, more of which later.

Lets continue our investigation into the harddrive failure data, and as ever, the plots and tables in the notebook are accompanied by detailed comments:

To summarise the notebook:

Model specification

The first obvious difference we encounter with this regression method is to transform the row-based dataframe into a design matrix using the patsy package and a model-specification string manufacturer + capacity. This modelspec is simply a GLM1 and the flexible approach afforded by Cox PH lets us declare any modelspec we wish, for example rather than $\beta_0X_0 + \beta_1X_1 + \beta_2X_2$ we could try $\beta_0X_0^2 + \beta_0X_0 + \beta_1X_1 * \beta_2X_2$ etc.

Baseline hazard

The Cox PH model takes a few iterations to converge on a fit and the lifelines package stores several results on the CoxPHFitter object for convenience including the cumulative baseline hazard rate $\Lambda$ and baseline survival rate $S = e^{-\Lambda}$.

That baseline hazard rate increases to 1% at approx 5 years power-on time. This has been set by whatever feature combination was set on the intercept, in this case manufacturer == HGST and capacity == 1.5TB which are quite low-failure values.

Proportional hazards

The baseline is of course, modified by the relevant $exp(\beta\bf{X})$ values and it's here that we can make some interesting comparative observations on the model:

  • An 'average' Seagate drive has approx 22x the hazard rate of an 'average' HGST drive, and approx 12x that of an 'average' WDC drive
  • A 'average' 3TB drive has a hazard rate approx 14x that of a 1.5TB drive and 3x that of a 4TB drive
  • Combining features, our model tells us a Seagate 3TB drive has a hazard rate $(22 + 14 = 36)$x that of the baseline; $22$x larger than an HGST 3TB drive, and $10$x larger than a WDC 3TB drive.

These ratios may seem surprising at first: can a Seagate 3TB drive really be an estimated 22x more likely to fail than an HGST 3TB drive? Look back to the Kaplan-Meier modelling in the previous post and we see the survival functions for 3TB drive as measured by a Kaplan Meier model. If we use the relation $\lambda(t) = -log(S(t))$ and the comparison $\frac{\lambda(t)_{Seagate.3TB}}{\lambda(t)_{HGST.3TB}}$, we see:

  • At 1 year, Seagate 3TB drives are $\frac{log(0.987)}{log(0.997)} = 4.4$x more likely to fail than HGST 3TB.
  • At 2 years, $\frac{log(0.85)}{log(0.994)} = 27$x
  • At 3 years, $\frac{log(0.417)}{log(0.983)} = 51$x

... so the 22x multiplier is a reasonable value. This also neatly demonstrates a weakness of the Cox PH approach: the proportional hazards are modelled as time-invariant, whereas here we see a wide, time-varying range from 4x to 51x.

Model evaluation

We can use a concordance index and k-fold cross validation to evaluate the performance of the model, for more details see the Notebook itself.

We find a mean average concordance of 0.76, which indicates a reasonably accurate model and we could quite happily use it to predict the survival of as-yet-unseen harddrives of the same manufacturers and capacities.

Weaknesses of the Cox PH model

As shown above, we can't always assume that a comparative hazard rate is constant with time, for instance:

  • time-to-first-purchase habits may initially differ greatly between shoppers who arrived at a website via different adverts, but those differences may disappear over time
  • post-surgery, the probability curve of patient recovery differs between the young and the old over time
  • in our harddrive data we see that 3TB Seagate drives have between 4x, 27x and 51x the hazard rate of HGST 3TB drives over the first 3 years of power-on

Despite offering quick calculation and easily interpretable results, the central assumption of constant proportional hazards is a weakness of the Cox PH model. It's always useful to start a survival analysis with Kaplan-Meier and/or Cox PH in order to set a baseline, but we are unlikely to stop with these models and call it complete.

Next steps

For now, we'll pause this series here, since we've covered a lot of ground:

  • In part 1 we described Survival Analysis as a method for measuring time-to-event behaviour and worked though several considerations including the hazard function, half-life and censoring
  • In part 2 we demonstrated the preparation of an analysis environment using scientific Python, and also showed our acquiring, storing and preparing the harddrive failures data
  • In part 3 we showed a set of initial analyses to further understand the data, and ran a few Kaplan-Meier models to discover the survival functions of the whole population and for specific drive manufacturers and capacities
  • And of course above, we've used a Cox PH model to describe the relative impact of different manufacturers & capacities on harddrive failure rates, and we've also tested the model predictions using a standard concordance measure.

When we return to this series we may discuss more complicated regression models which let us make more accurate and reasoned predictions, including:

Aalen Additive Model

This survival regression model extends the Cox PH concept to perform a piecewise linear regression on the cumulative hazard value at each point in time along the x-axis:

$$\lambda(t_i) = \sum\limits_{j=1}^{j=N}\beta_0(t_i) + \beta_j(t_i)x_j$$

where $\beta_j(t_i)$ lets us have a time-varying covariate for feature $x_j$. We can also add normalisation parameters (e.g. L1 or L2 norms) to help smoothen the regressions.

Gaussian Process Regression

Gaussian Process models are a generic Bayesian method to estimating arbitrary functions using exogenous features. As a probabilistic method we gain great flexibility, data interpolation, empirical confidence intervals (credible regions) and the ability to include prior assumptions.

As mentioned earlier, please do get in touch if you'd like named access to the private repository used for this series of blogposts.

  1. We will very likely visit Generalised Linear Models (GLMs) in future articles

Jonathan Sedar

Jon founded Applied AI in 2013. He's a specialist in Bayesian statistics and the application of machine learning to yield insights and profits throughout fintech, banking and insurance sectors.