*We now have a clean, prepared, real-world dataset regarding the failures of thousands of harddrives, lets see what we can learn from a basic survival analysis.*

In the previous post we found and prepared some interesting real-world data suited for a survival analysis: the records of activity and failures for >40,000 harddrives, open-sourced and made freely available in Feb 2015 by BackBlaze, an online file backup and storage company.

In this post we will firstly explore the data further, and then run Kaplan-Meier modelling to understand the survival rates for the harddrives.

# Initial Analyses

Typically, we would view the basic counts & distributions as part of the initial data prep, but have broken the work out here for clarity and to demonstrate some more of the possibilities for simple interactive data exploration and visualisation offered by `pandas`

and `seaborn`

packages.

As ever, the plots and tables in the notebook are accompanied by detailed comments:

To summarise the initial analyses notebook:

- Seagate and HGST make up the vast majority (96%) of drives in the study, mostly with 3TB and 4TB capacities (a combined 55% of total), and there are some class imbalances which may make direct comparisons trickier.
- WDC drives have been in usage for much less time than the others, and Seagate drives seem to have experienced a large number of failures around 20,000 hours (2.25 years).
- In the final scatterplot we see that in particular, Seagate 3TB drives showed a comparatively large count of failures, so we may expect them to show a shorter survival lifetime.

# Kaplan-Meier Modeling

This model gives us a maximum-likelihood estimate of the survival function:

$$\hat S(t) = \prod\limits_{t_i \\< t} \frac{n_i - d_i}{n_i}$$

where $d$ and $n$ are respectively the count of 'death' events and individuals at risk at timestep $i$.

The cumulative product gives us a non-increasing curve where we can read off, at any timestep during the study, the estimated probability of survival from the start to that timestep. We can also compute the estimated survival time or median survival time (halflife).

Lets apply the Kaplan-Meier model to our harddrive failure data, at an overall level and also look at subsets by manufacturer and capacity.

Observations from Kaplan-Meier modelling notebook:

- Overall, the population of harddrives show very low failure rates (long durations for time-to-event), with approx 1.4% failure after 1 year of power-on, 23% failure after 5 years of power-on.
- When separating the dataset by manufacturer, we see overall that Seagate drives demonstrate a far larger failure rate than HGST and WGST: approx 40% failure at 5 years of power-on.
- When separating by capacity too, we see a much clearer picture:
- The high failure rate for Seagate seems to be mainly in their 2TB and 3TB drives, with the latter having a measured halflife of approx. 23,000 hours or 2.6 years.
- As discussed in the Notebook, most of this dataset has such few failures that we don't see a measured halflife, so it's comparatively unusual to measure a halflife at all for the Seagate drives
- The Seagate drives fare better in the 4TB and 6TB capacities, and show similar good survival with HGST and WDC.

- When we isolate to the 3TB drives the comparison between drive failures becomes especially clear:
- Seagate starts out as well as the others, with 1.3% failure at 1 year, actually faring 3x better than WDC's 3.6% failure
- The Seagate drives then begin to drop quickly and by the 2 year mark have 15% failure which compares badly to HGST's 0.6% and WDC's 6.4%
- At the 3 year mark we only have data for HGST and Seagate (WDC drives were not in use for long enough) and the comparison is poor: Seagate has experienced 58.3% failure compared to 1.7% for HGST, roughly a 35:1 ratio of failures.

## Outputs & Capabilities of Kaplan-Meier modelling

As demonstrated, the `lifelines`

package makes it very simple to run a Kaplan-Meier model:

```
## create & fit the model to our dataframe df (runs in <1sec)
km = sa.KaplanMeierFitter()
km.fit(durations=df['maxhours'], event_observed=df['failed'])
## km object now contains calc. survival function for plotting etc
```

... but this is a non-parametric, non-generalising model which simply 'remembers' all the input data to create a survival function. Without a parametric form we can't use the model to predict new, unseen data, and it's hard for us to quantify the relative impact upon survival of a hardrive coming from e.g. Seagate vs HGST, or being e.g. 4TB vs 3TB.

### Next Steps

What we really need is a way to determine the proportional impact of different factors in the dataset. In part 4 of this series will look at the Cox Proportional Hazards model, which does just that.