*In this series of blogposts we'll explain tools & techniques of dealing with time-to-event data, and demonstrate how survival analysis is integral to many business processes.*

Survival analysis is long established within actuarial science and medical research but seems infrequently used in general data science projects. In this series of blogposts:

We'll introduce survival analysis as a vital technique in any statisticians toolkit

We'll demonstrate a general approach to undertaking a data science project: accessing, cleaning and storing data, and using a range of open-source analytical tools that enable rapid iteration, data exploration, modelling and visualisation

We'll use a real-world dataset and seek to both match and improve upon some existing analysis already undertaken and released by a third-party.

By the end you should have a better understanding of the theory, some tools and techniques, and hopefully gain some ideas about how survival analysis can be applied to all manner of event-based processes that are often crucial to business operations.

So firstly...

# What is Survival Analysis?

Wikipedia defines survival analysis as:

a branch of statistics that deals with analysis of time duration until one or more events happen, such as death in biological organisms and failure in mechanical systems.

We might, for example, expect an actuary to try to predict what proportion of the general population will survive past a certain age^{1}. She might want to furthermore know the rate of change of survival with time (the **hazard function**), and the characteristics of individuals which most influence their survival.

More generally, we can use survival analysis to model the expected **time-to-event** for a wide variety of situations:

**User shopping behaviour**e.g. the elapsed time between a user subscribing to an online shopping service and ordering their first product**Crop yields and harvesting**e.g. the duration between seeding a field and the majority of the crops being ready for harvest**Radioactive halflife**e.g. the time until half the atoms in a luminous blob^{2}of tritium have decayed into hydrogen and helium-3**Hardware failure rates**e.g. the expected lifetime for a piece of machinery before component failure.**Customer subscription persistence**e.g. the expected time for a customer to remain subscribed to a cellphone service before churning

# Illustrating the basics

Imagine we have a fleet of haulage trucks and we're particularly interested in the elapsed time between purchase and first maintenance event (aka repairs aka servicing). We could use this analysis to:

Identify which manufacturers and models of trucks require the least repair and favour buying those again in future

Identify contributing factors leading to trucks needing earlier repairs and try to mitigate

Anticipate likely spikes of activity for fleet repair during the year and ensure funds are available in advance

## Sketching a survival curve

We observe:

The relative duration of our study is measured from time of purchase until the end of the second year (24 months). This is a

**relative time**and may begin at a different time for each truckThe survival function is a measure of how many trucks are serviced at each point in time. It drops quickly and then flattens out; it crosses the 50% line at 10 months, meaning that by 10 months we can expect 50% of all the trucks in the fleet to have had their first service.

About 36% of trucks remain unserviced at the end of the first 24 months; conversely, about 64% will have a service event during their first 24 months

## Two fundamental measurements

### The survival function $S(t)$

The survival function, $S(t)$, of an individual is the probability that they survive until at least time $t$.

$$S(t) = \Pr(T > t)$$

where $t$ is a time of interest and $T$ is the time of event.

The survival curve is non-increasing (the event may not reoccur for an individual) and is limited within $[0,1]$. Note that the event might not happen within our period of study and we call this **right-censoring**. This happens in the above example where for 36% of trucks, all we know is that their first service happens some time after 24 months.

### The hazard function $\lambda(t)$

The hazard function $\lambda(t)$ is a related measure, telling us the probability that the event $T$ occurs in the next instant $t + \delta t$, given that the individual has reached timestep $t$:

$$\lambda(t) = \lim_{\delta t \rightarrow 0} \frac{\Pr(t \leq T < t + \delta t \ | \ T > t)}{\delta t}$$

With some maths^{3} we can work back to the Survival function:

$$S(t) = exp(- \int\limits_{0}^{t} \lambda(u)du)$$

The hazard function $\lambda(t)$ is non-parametric, so we can fit a pattern of events that is not necessarily monotonic.

# Other measurements and considerations

## The cumulative hazard function

An alternative representation of the time-to-event behaviour is the cumulative hazard function $\Lambda(t)$, which is essentially the summing of the hazard function over time, and is used by some models for its greater stability. We can show:

$$\Lambda(t) = \int\limits_{0}^{t} \lambda(u)du = -\log S(t)$$

... and the simple relation of $\Lambda(t)$ to the survival function $S(t)$ is a nice property, exploited in particular by the Cox Proportional Hazards and Aalen Additive models, which we'll demonstrate later.

Stating it slightly differently, we can relate the attributes of the individuals to their survival curve:

$$S(t) = -e^{\sum \Lambda_i(t)}$$

This powerful approach is known as **Survival Regression**. In the trucks example, we might want to know the relative impact of engine size, hours of service, geographical regions driven etc upon the time from first purchase to first service.

## The half-life

We saw the half-life of truck repair illustrated above (the orange arrows). Most people are familiar with this measure and it's exactly as it says on the tin:

- select a group of individuals - our fleet of trucks - and measure how long it takes for the event of interest to occur.
- once the event of interest - the first service repair - occurs for half of the population, that period is aka the half-life.
- note that we can't state exactly which truck or exactly when, just work on the aggregate values.

There's nothing particularly special about the half-life, and we might be interested in the time taken for e.g. 25% of the trucks to come in for first service, or 75% or 90% etc.

## Censoring and truncation

We're measuring time-to-event in the real world and so there's practical constraints on the period of study and how to treat individuals that fall outside that period. Censoring is when the event of interest (repair, first sale, etc) occurs outside the study period, and truncation is due to the study design.

The following discussion continues our hypothetical truck maintenance study:

We imagine that our data source is a set of database extracts taken at the start of 2014 for a 3-year period from mid-2010 to mid-2013, and this is all of the data we could extract from the company database

It may be that we have very little data about service-intervals longer than 24 months, so despite the study period covering 36 months, when we calculate survival curves we decide to only look at the first 24 months of a trucks life

All trucks remain in daily operation through end-2014, none are sold or scrapped.

We observe:

Truck A was purchased at start-2011 and first serviced 21 months later. It has a first-service-period or 'survival' of 21 months

Truck B was also purchased at start-2011 but first serviced 9 months later. This 9-month survival is much shorter than for Truck A: perhaps it has a different manufacturer or was driven differently. The various survival models let us explore these factors in different ways

Truck C was first serviced in Feb 2012, but purchased prior to the start of our study period. This

**left-truncation**is a consequence of our decision to start the study period at mid-2010, and we may mitigate by adjusting the start of the study period or adjusting the lifetime value of the truckTruck D was purchased in Feb 2010 and serviced soon afterwards - both prior to the study period - so we did not observe the service event during the study. In fact, if we're strict about our data sampling then we wouldn't even know about this event from our database extract. The need to account for such

**left-censoring**is rarely encountered in practiceTruck E was purchased in Jun 2012 and by the end of the study in mid-2013 it had still not gone for first service. This

**right-censoring**is very common in survival analysisTruck F does not appear at all in the purchases database and the first time we learn it exists is from the record of its first maintenance service during 2012. We are unlikely to encounter such

**right-truncation**in practice, since we're dealing with well-kept database records for purchases.

Censoring and truncation differ from one analysis to the next and it's always vitally important to understand the limitations of the study and state the heuristics used. Generally, one can expect to deal often with right-censoring, occasionally with left-truncation, and very rarely with left-censoring or right-truncation.

# What models can we use?

The very simplest survival models are really just tables of event counts: non-parametric, easily computed and a good place to begin modelling to check assumptions, data quality and end-user requirements etc.

### Kaplan-Meier Model

This model gives us a maximum-likelihood estimate of the survival function $\hat S(t)$ like that shown in the first diagram above.

$$\hat S(t) = \prod\limits_{t_i \\< t} \frac{n_i - d_i}{n_i}$$

where $d$ and $n$ are respectively the count of 'death' events and individuals at risk at timestep $i$.

The cumulative product gives us a non-increasing curve where we can read off, at any timestep during the study, the estimated probability of survival from the start to that timestep. We can also compute the estimated survival time or median survival time (half-life) as shown above.

### Nelson-Aalen Model

A close alternative method is the Nelson-Aalen model, which estimates the cumulative hazard function $\Lambda(t) = -log S(t)$ and is more stable since it has a summation form:

$$\hat \Lambda(t) = \sum\limits_{t_i\leq t} \frac{d_i}{n_i}$$

where again, $d$ and $n$ are respectively the count of 'death' events and individuals at risk at timestep $i$.

This approach of estimating the hazard function is the basis for many more methods, including:

### Cox Proportional Hazards Model

Having computed the survival function^{4} for a population, the logical next step is to understand the effects of different characteristics of the individuals. In our truck example above, we might want to know whether maintenance periods are affected more or less by mileage, or by types of roads driven, or the manufacturer, model or load-capacity of truck etc.

The Cox PH model gives a semi-parametric method of estimating the hazard function at time $t$ given a baseline hazard that's modified by a set of covariates:

$$\lambda(t|X) = \lambda_0(t)\exp(\beta_1X_1 + \cdots + \beta_pX_p) = \lambda_0(t)\exp(\bf{\beta}\bf{X})$$

where $\lambda_0(t)$ is the non-parametric baseline hazard function and $\bf{\beta}\bf{X}$ is a linear parametric model using features of the individuals, transformed by an exponential function.

We can now make comparative statements such as *"the hazard rate for trucks from manufacturer A is 2x greater than for manufacturer B"*, or *"trucks that drive 100 fewer km per week tend to have their first service up to 6 months later"* etc.

### There's many more models...

Survival analysis has been developed by many hands over many years and there's an embarassment of riches in the literature, including:

- Parametric / Accelerated Failure Time models, using:
- Exponential
- Gompertz
- Weibull
- Log-logistic link functions

- Aalen Additive model
- Bayesian inferential models

... but we'll save the detail on those for specific examples in future posts in this series.

# Recap

It takes a surprising amount of detail to explain the basics of survival analysis, so thank you for reading this far! As noted above, time-to-event analyses are very widely applicable to all sorts of real-world behaviours - not just studies of lifespan in actuarial or medical science.

In the rest of this series we'll use a publicly available dataset to demonstrate implementing a survival model, interpreting the results, and we'll try to learn something along the way. Continued in part 2

As we noted previously, it was life-insurers who were among the first to employ strong data analysis in their profession. ↩

If you have an old watch (pre-2000) with "T SWISS MADE T" somewhere on the dial, the T indicates that a luminous Tritium-containing compound is painted onto the hands and dial. You'll probably also sadly notice that it's completely failing to glow in the dark because the radioactive half-life of ~12.4 years means there's very little Tritium left now. ↩

For a thorough but easily-digested derivation of survival and hazard functions, see the Wikipedia page. ↩

The survival, hazard and cumulative hazard functions are often mentioned seemingly-interchangeably in casual explanations of survival analysis, since usually if you have one function you can have the other(s). In our experience its best to present results to general audiences as a computed survival curve regardless of the model used. ↩