We've reviewed the basic theory of survival analysis and discussed why it's a useful technique; now lets acquire, explore and prepare a real dataset for analysis.
As mentioned in the previous post in this series, survival analysis is long established within actuarial science and medical research but seems infrequently used in general data science projects. In this post we'll acquire a small, suitable dataset and prepare it for further analysis - demonstrating some general tools & techniques along the way.
Let's find an interesting dataset
The quality and volume of open data - that which is freely available for reuse and analysis - has grown over recent years and if you look carefully, there's likely an interesting dataset that's ideal for your latest algorithm or visualisation1.
In rough order of increasing data complexity and reducing cleanliness, good places to look for such data include:
- The UCI Machine Learning Repository - home to a few hundred classic datasets
- Competitions on Kaggle - for better or worse these can be riddled with real-world data quality problems, and beware of licensing issues
- The Wikimedia Foundation research team - lots of info on browsing and article content
- Opened datasets from CrowdFlower a data-enrichment company
- Government-curated open datasets from e.g. the UK, the EU, and USA - highly varying data quality throughout
- This large list of publicly available datasets on Github
- ... and there's a huge amount of requests, discussion and links on the /r/datasets subreddit.
After a solid search, I came upon a rich dataset of hard drive failure data open sourced and made freely available by BackBlaze, an online file backup and storage company, in Feb 2015.
The full dataset comprises nearly 20 months of logfiles describing the daily uptime and failure status of >40,000 harddrives installed at the company over the period April 2013 to Dec 2014 inclusive. The dataset contains information about the harddrive model and capacity, the dates of operation and failure and up to 40 different S.M.A.R.T. stats about each drive, potentially making for quite a rich survival analysis.
What's even more interesting is that BackBlaze have undertaken and released their own analyses of hard drive reliability so we have something to compare.
Preparing an analysis environment
We'll use an interactive Python environment for analysis and a Git repository for source control. As mentioned in a previous post Python has become a default choice for many data scientists for most general work - thanks to it's wide range of libraries, high quality software engineering and increasingly excellent visualisation; and Git is of course a standard choice for distributed version control, vital for maintaining and sharing code.
For now, I've decided to host the code and analyses from this short investigation on a Bitbucket repository. If you'd like named access then please feel free to get in touch.
Detail on Python
Python is available in many flavours from many sources, and we currently like to use the Anaconda distribution from Continuum Analytics, largely because of the ease of use of their
conda package & virtual environment manager.
Here's a snippet at my OSX Terminal to create a new Python 3.4 virtual environment called
survivalenv including most of the packages I want to use:
MBP:survival jon$ conda install -n survivalenv python=3.4 --file requirements_conda
Inside the file
requirements_conda is a list of packages I want to install into this virtual environment, including
- ipython, ipython-notebook, ipython-qtconsole - for an interactive Python environment with notebooks and console
- numpy, scipy, matplotlib - the standard set of Python libraries for scientific computing
- pandas - the now-standard library for handling tabular data
- scikit-learn - the now-standard library for general machine-learning
- seaborn - a newish library capable of really beautiful data visualisations
After initialising this environment, we'll also install the main package that we'll be using later in this survival analysis: lifelines by Cam David-Pilon of "Bayesian Methods for Hackers" fame. Lifelines contains a variety of efficient, well-documented algorithms for survival analysis that were previously only available in R libraries, ideally suited for our small-data task.
Acquiring and storing the data
BackBlaze have made this first step really easy for us - full instructions at their main dataset blogpost here. They've not only provided cleaned logfiles for the entire period, but also a small set of SQL files to create a sqlite database with basic counts of failures rates as reported in their study.
The raw data totals ~17.5 million rows of daily log data for each harddrive, and the uncompressed sqlite database runs approx 3GB in size.
The BackBlaze study is not especially detailed (hence our investigation) and the SQL scripts only produce basic summaries. We need a table of counts of days of operation for each drive, so lets quickly execute a chunk of new SQL against the sqlite DB:
## prepare_survival.sql .echo on drop table if exists drive_survival; create table drive_survival as select serial_number as diskid ,model as model ,capacity_bytes as capacitybytes ,min(date) as mindate ,max(date) as maxdate ,count(date) as nrecords ,min(smart_9_raw) as minhours ,max(smart_9_raw) as maxhours ,sum(failure) as failed from drive_stats group by serial_number, model, capacity_bytes; .echo off
Now the new table
drive_survival has, for each individual drive throughout the study:
- the drive manufacturer and capacity
- the first and final dates of operation
- the number of days in which it appears (this may not equal the elapsed days between first and final date)
- the first and final value of
smart_9_raw, the summed hours in operation
- a flag of whether the drive failed or survived the whole way through
Initial observations, data cleaning and feature engineering
Any statistician or technology article will tell you that data preparation will take a majority of time on a data science project; an experienced statistician will recognise that this is a vitally important part of the process in order to further understand the problem to be answered and nature of the insights possible.
Unless the data source is an extremely well-administered database with well-maintained documentation, testing and support, then it's inevitable that the analyst will need to spend time interrogating the dataset, understanding patterns, correcting erroneous values, explaining and interpolating missing entries etc. Julia Evans has a great blogpost elaborating more on how this is a critical part of the process.
When we approach a new project we always run the dataset (or a representative sample) through a very initial preparation and investigation. In the following frame is an iPython Notebook, rendered to html and hosted from a static webserver on Amazon S3. The html rendering is so that we can share it here using our blogging platform, but ordinarily we might simply run the notebook on a server so that we can write live code and notes side by side in a reproducible, living, versioned report.
To summarise the initial data prep notebook:
- After excluding erroneous data and some drives belonging to under-represented manufacturers and odd capacities, we have records for
- We have a massive left-truncation, with a large volume of harddrives already in daily operation at the first date of the study. This means that our harddrive lifetime can't simply be a calculated duration between first and final dates of operation.
- Fortunately we have access to the S.M.A.R.T. 9 statistic, which is a running total of hours of operation. We can use this to calculate a lifetime of operational hours.
- We have two clean features to use for predicting lifespan:
capacity, and a max observed lifespan of approx 48,000 hours / 2,000 days / 5.5 years, which may lead to some useful real-world insights.
Now that the data is acquired, stored, cleaned and well-understood, we're ready to undertake some survival analysis in part 3 of this series.
Practicing data scientists / statisticians will be familiar with the classic datasets used to benchmark algorithmic performance and demonstrate core concepts, such as Iris (multivariate classification), Twenty Newsgroups (natural language processing), Handwritten Digits (supervised multivariate classification), and EEG Database (time-series indexing). ↩