Bayesian inference bridges the gap between white-box model introspection and black-box predictive performance. This technical series describes some methods using PyMC3, an inferential framework in Python.
Preamble
I find it useful to think of practical data science as spanning a continuum between traditional statistics and machine learning.
On the left hand side, the tools and techniques of the Statistician | On the right hand side, the tools and techniques of the Informatician |
---|---|
Simple feature engineering, simple classifications/regressions and introspectable 'white-box' models easily expressed in mathematical notation. | Computer systems to store, process and retrieve information, complex data processing and 'black-box' models to learn from data. |
Models are intentionally simpler and may have closed-form expressions easier to compute. They may be created according to known physical laws and/or scientific assumptions about the outside world. | The models may also have a simple form, but typically require heavy computation and have high overall complexity due to combinatorics, non-linear transforms, local optimisations, stochasticity and ensemble effects. |
(Generalized) Linear Models, Linear Compression (PCA / SVD), Decision Trees etc. | Random Forests, Support Vector Machines, (Deep) Neural Nets, Non-linear transforms etc. |
To illustrate:
We might use tools from traditional statistics when creating a time-to-event model because it's useful to know exactly how a particular attribute affects survival. e.g. "How long can I expect a hardrive from Manufacturer A to last compared to B?"
We might use tools from machine learning when clustering data because we want to make unbiased predictions and let the data speak for itself. e.g. "Are these datapoints somehow similar and can I use that to predict future data?"
The strength of the 'Statisticians approach' is the ability to learn about the effects of different inputs upon the model & results - defining behavioural rules for later manipulation.
The strength of the 'Informaticians approach' is to make fewer assumptions about the exact form of the model and let the data do the talking - making predictions about future data.
The best of both worlds: Bayesian Inference
Somewhere towards the middle of the continuum lies Bayesian Inference.
In a nutshell, this is the application of Bayesian probability theory to create powerful white-box models which require quite heavy black-box computation. The benefits are model introspection, high-quality predictions, hypothesis testing and model evaluation.
Background
We've discussed Bayesian analysis in a previous series of articles, but to recap:
We use Bayes Theorem in a principled way, to relate the posterior probability $P(H|D)$ of a hypothesis $H$ given evidence of data $D$, to the likelihood $P(D|H)$ of the data under that hypothesis and the prior probability $P(H)$ of that hypothesis being correct. The denominator $P(D)$ is the marginal likelihood which is the same for all hypotheses and can be ignored.
$$P(H|D) = \frac{P(D|H) P(H)}{P(D)}$$
To belabour the point, this is a very natural way of setting and evaluating expectations in the real world: we assume an initial hypothesis (the coffee looks hot), gather data (the coffee is actually cold), and update our beliefs (coffee that looks hot might actually be cold). It is different from Frequentist statistics in many ways, the largest of which is that Bayesian statistics allows for prior knowledge, and it also makes predictions based on the data available, rather than extrapolating beliefs to infinite data collection^{1}.
Great, but how do we actually use this for modelling?
What do we want from a model? A function that maps input data to create an approximation of some other data.
Lets define:
- $X$ is a vector of $n$ input datapoints $x_{1}, x_{2} ... x_{n}$ which may be each described by a single value, or several features (dimension $d$)
- $y$ is a vector of $n$ values to be predicted, one per datapoint in $X$
- $\theta$ is a vector of parameters size $d$, which define the distribution of each datapoint $x$ i.e $x \sim p(x|\theta)$
Then we declare a model specification wherein:
- $p(\theta)$ is the prior distribution or belief of $\theta$ before any data is observed
- $p(X|\theta)$ is the likelihood of the observed data given the parameters $\theta$
- $p(X)$ is the marginal likelihood which we ignore as above
- The posterior distribution of the parameters $p(\theta|X)$ is:
$$p(\theta|X) \propto p(X|\theta) p(\theta)$$
Now in order to learn the correct values (or more properly, the distributions) for $\theta$ we need to evaluate the posterior predictive distribution for values $y$ and in the continuous case, this takes the form:
$$p(y|X) = \int_{\theta} p(\theta|X) p(y|\theta) d\theta$$
Unfortunately, calculating the integral over all values of $\theta$ is a serious challenge. In some cases we can use conjugate priors which means the likelihood and the prior have the same functional form, and thus the data directly updates the parameters $\theta$.
However this isn't very generalisable, and multidimensional integration is hard. This is about as far as we can get without some method for estimating the parameter values.
MCMC Sampling
In the early nineties, an efficient technique for efficiently estimating the parameter values - Markov-Chain Monte-Carlo (MCMC) sampling - was brought into the mainstream by researchers including Alan Gelfand^{2} and Adrian Smith.
The general principle is to:
- assume a functional form (probability distribution) for parameters $\theta$ according to hyperparameters $\alpha$
- 'sample' this joint probability distribution to get a vector of single values for $\theta$ across $d$
- compute the posterior predictions $\hat{y}$ for observed datapoints and compare to the ground truth $y$ using a loss function
- update the joint $\theta$ distributions according to the new density of samples
- repeat the sampling and evaluation many more times (typically 100x to 10,000x), seeking to take progressively better samples of $\theta$ according to minimising the loss function, and arriving at convergence where any new samples don't improve the loss
- the MCMC sampling takes the form of a Markov-Chain (the position of step $n+1$ is dependent only upon the position of step $n$, and otherwise independent of all other steps) which moves around the joint distribution in a semi-random manner, the distance and direction decided according to specific rules of the sampling method, including randomness (the Monte-Carlo aspect), gradient-seeking and momentum.
This approach requires a lot of computational power, and some clever maths to ensure the sampling converges and the loss function is minimised well. There's a potted history of the development of MCMC which is great reading, and mentions some of the first software to be widely available from 1991: BUGS (Bayesian inference Using Gibbs Sampling).
This blog series focusses on a fairly new software library to perform MCMC in pure Python: PyMC3. Gladly, it is open source and under active development from a set of well-qualified, dedicated volunteers.
PyMC3 is an iteration upon the prior PyMC2, and comprises a comprehensive package of symbolic statistical modelling syntax and very efficient gradient-based samplers using the Theano library (of deep-learning fame) for gradient computation. Of particular interest is that it includes the Non U-Turn Sampler (NUTS) developed recently by Hoffman & Gelman in 2014, which is only otherwise available in STAN.
STAN is a major^{3} open-source framework for Bayesian inference developed by Gelman, Carpenter, Hoffman and many others. STAN also has HMC and NUTS samplers, and recently, Variational Inference - which is a very efficient way to approximate the joint probability distribution. Models are specified in a custom syntax and compiled to C++ but it has has interfaces to Python, R etc. to make life easier. Since Mick Cooney is a fan, we'll almost certainly discuss it in future blogposts.
Ordinary Least Squares Regression with PyMC3
Finally we're ready to demonstrate Bayesian Inference using PyMC3. This is a potentially huge topic, and I'll leave plenty for future posts. For now, let's start really simple with a tiny dataset and the intention to find a linear model that explains it.
Theory of a basic Linear Regression Model:
Stated in Frequentist syntax:
$$\bf{y} = \beta^{T} \bf{x} + \bf{\epsilon}$$
... where for datapoint $i \in n$:
$y_{i}$ = output feature value
$x_{i}$ = input feature values length $j$
$\beta$ = coefs = $[1, \beta_{j}]$
$\epsilon_{i} \sim N(0,\sigma^2)$ = independent Gaussian noise on the measurement error
For ordinary least squares (OLS) regression, we can quite conveniently solve this (find the $\beta$ values) using the Maximum Likelihood Estimate (MLE) which has a closed analytical form:
$$\sum_{i=1}^{n}(y_{i} - x_{i}^{T}\beta)x_{i}^{T} = 0$$
$$\beta_{MLE} = (\bf{X}^{T}\bf{X})^{-1}\bf{X}^{T}\bf{y}$$
For non-OLS regression, finding $\beta_{MLE}$ is an optimisation problem, which I won't get into here.
Stated slightly differently in a Bayesian syntax:
$$\bf{y} \sim \mathcal{N}(\beta^{T} \bf{x},\sigma^{2})$$
... where for datapoint $i \in n$:
$y_{i}$ is a sample from a $\mathcal{N}$ormal distribution defined by
mean $\mu = \beta^{T} x_{i}$ and variance $\sigma^{2}$
This syntax^{4} indicates the concept that the realised datapoints are just samples from a distribution of many possible values, itself described by a linear model with variance.
In order to solve this model formulation, we can express the likelihood as the probability of $y_{i}$ given the model parameters:
$$P(y_{i} | \beta,x_{i},\sigma^2) = \mathcal{N}(\beta^{T} x_{i}, \sigma^{2})$$
and maximise the overall likelihood $\mathcal{L}$ using:
$$\mathcal{L} = \prod_{i=1}^{n}{P(y_{i} | \beta,x_{i},\sigma^2)}$$
Usually one would maximise the log-likelihood $log\mathcal{L}$ which is convenient to calculate since it becomes a summation, and compresses the otherwise potentially large likelihood values. Relating this back to the Frequentist MLE, we're trying to find the values of $\beta$ which minimise the first-derivative of the likelihood:
$$\beta_{MLE} = \overset{argmin}{\beta} = \frac{1}{2} \; \sum_{i=1}^{n}(y_{i} - x_{i}^{T}\beta)^{2}$$
Enough maths for now! Lets create a toy dataset and fit it using firstly a Frequentist OLS model, and then a Bayesian OLS model.
Create a very simple dataset and plot it:
I'll use the usual suspects: numpy
and pandas
to create a dataset, and the ever-excellent seaborn
library for plotting.
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="darkgrid", palette="muted")
rndst = np.random.RandomState(0)
def generate_data(n=20, a=1, b=1, c=0, latent_error_y=10):
'''
Create a toy dataset based on a very simple linear model
that we might imagine is a noisy physical process
Model form: y ~ a + bx + cx^2 + e
'''
## create linear or quadratic model
df = pd.DataFrame({'x':rndst.choice(np.arange(100),n,replace=False)})
df['y'] = a + b*(df['x']) + c*(df['x'])**2
## add latent error noise
df['y'] += rndst.normal(0,latent_error_y,n)
return df
df = generate_data(a=5, b=2, latent_error_y=30)
g = sns.lmplot(x='x', y='y', data=df, fit_reg=False
,size=6, scatter_kws={'alpha':0.8, 's':60})
## NOTE: `lmplot()` will fit and plot a lin. reg. line by default.
## Not used here, but can greatly help data exploration in practice.
Observe
- We have 20 datapoints described by a random value in x and a linear relation to y plus some Gaussian noise:
- The parameters $\beta$ of our data are $[5, 2]$
- The variance $\sigma^{2}$ of the noise in the data is $30$
- The true 'clean' or non-noisy value of y is overplotted in green
Create and fit a Frequentist OLS model
I'll use the Python statsmodels
library which has nice output similar to R's lm
and glm
functions. Also using patsy
for symbolic model specification.
import patsy as pt
import statsmodels.api as sm
## first, encode model specification as design matrices
fml = 'y ~ 1 + x'
(mx_en, mx_ex) = pt.dmatrices(fml, df, return_type='dataframe'
,NA_action='raise')
## fit OLS model and print results
smfit = sm.OLS(endog=mx_en, exog=mx_ex, hasconst=True).fit()
print(smfit.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.864
Model: OLS Adj. R-squared: 0.857
Method: Least Squares F-statistic: 114.6
Date: Tue, 09 Feb 2016 Prob (F-statistic): 3.10e-09
Time: 13:16:59 Log-Likelihood: -92.440
No. Observations: 20 AIC: 188.9
Df Residuals: 18 BIC: 190.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -2.0366 11.614 -0.175 0.863 -26.437 22.364
x 1.9984 0.187 10.705 0.000 1.606 2.391
==============================================================================
Omnibus: 0.384 Durbin-Watson: 3.042
Prob(Omnibus): 0.825 Jarque-Bera (JB): 0.520
Skew: 0.125 Prob(JB): 0.771
Kurtosis: 2.251 Cond. No. 125.
==============================================================================
Lets evaluate the model
1. Model Parameters
The statsmodels
OLS class computes a menagerie of test statistics, but we're most interested in the estimates for our $\beta$, expressed as:
- Intercept: $\mu=-2.04, \sigma=11.61$
- x: $\mu=2.00, \sigma=0.19$
The Frequentist OLS has made a fairly good estimate of x, but the Intercept is quite imprecise under a linear model.
2. Model prediction
Lets plot the posterior prediction:
## NOTE: I'll use `seaborn`'s `lmplot` again.
## This will actually run its own regression, but it uses statsmodels OLS,
## so while strictly speaking this won't plot exactly the same prediction
## as the results above, it will be more than close enough for our purposes.
g = sns.lmplot(x='x', y='y', data=df, fit_reg=True
,size=6, scatter_kws={'alpha':0.8, 's':60})
Create and fit a Bayesian OLS model
Here, finally we'll use pymc3
to define a model specification and take samples of $\beta$ from the joint probability distribution and minimise the loss of the log-likelihood evaluated against the posterior predictive distribution.
import pymc3 as pm
from scipy.optimize import fmin_powell
with pm.Model() as mdl_ols:
## Use GLM submodule for simplified patsy-like model spec
## Use Normal likelihood (which uses HalfCauchy for error prior)
pm.glm.glm('y ~ 1 + x', df, family=pm.glm.families.Normal())
## find MAP using Powell optimization
start_MAP = pm.find_MAP(fmin=fmin_powell, disp=True)
## take samples using NUTS
trc_ols = pm.sample(2000, start=start_MAP, step=pm.NUTS())
Applied log-transform to sd and added transformed sd_log to model.
Optimization terminated successfully.
Current function value: 126.077720
Iterations: 6
Function evaluations: 221
[-----------------100%-----------------] 2000 of 2000 complete in 3.4 sec
There's a lot of detail in the above:
- We created a
PyMC3
model usingpatsy
model specification syntax for convenience. In the background this created a model and a set of 'free' and 'observed' variables in atheano
compute graph. Compared to STAN, the model specification is written in pure Python, and no external C++ code is compiled. - We used an optimizer from
scipy
to find the Max A-Posteriori (MAP) estimate of the joint probability distribution, and thus a good starting point for the sampler. - We then used the NUTS sampler to take 2000 samples from the joint probability distribution of $\beta$ and converge via evaluating and iteratively minimising the loss on the posterior predictive distribution $\bf{\hat{y}}$ with respect to the true values $\bf{y}$.
The results of the above are in the form of a 'trace' for each $\beta$ parameter, each 2000 samples long. The first few traces are likely to be poor estimates of the parameter values, because the sampling has not yet converged or 'burned-in'. We'll ignore the first 1000 samples as 'burn-in', and only consider the final 1000.
Lets view the traces:
ax = pm.traceplot(trc_ols[-1000:], figsize=(12,len(trc_ols.varnames)*1.5),
lines={k: v['mean'] for k, v in pm.df_summary(trc_ols[-1000:]).iterrows()})
We can use these traces in two very powerful ways:
The traces form a marginal distribution on each parameter (shown in the left-hand facets), letting us declare distributional statistics directly from the data with no assumptions about their functional form. We can quote the mean, variance, credible regions (CR) etc, and interrogate the full distribution to help determine & better understand unusual behaviours of the parameter values: skew, kurtosis, multi-modalities etc.
The raw trace values (shown in the right-hand facets) can be used to create 1000 posterior predictions of $y$ for any value of $x$, again without any assumptions about the functional form, and again, we can quote any distributional statistics we like from that distribution. The traces also offer a diagnostic check on the convergence (or lack thereof) of the model.
1. Model parameters
Lets take a look at the parameter values, using a convenience function:
print(pm.df_summary(trc_ols[-1000:]))
mean sd mc_error hpd_5 hpd_95
Intercept -1.367470 13.444980 0.658985 -25.138135 26.942143
x 1.989229 0.215509 0.010461 1.572763 2.432014
sd_log 3.305677 0.186010 0.012743 2.975143 3.683130
sd 27.758150 5.457890 0.374594 19.563791 39.734219
We're most interested in the estimates for our $\beta$, expressed as:
- Intercept: $\mu=-1.37, \sigma=13.44$
- x: $\mu=1.99, \sigma=0.22$
- sd (the Gaussian noise): $\mu=27.76, \sigma=5.46$
The Bayesian OLS has made very similar estimates of the parameter values as the Frequentist OLS. This is good to see, since both linear models are actually very similar.
Note we now also have an estimate of the Gaussian noise parameter, which is quite close to the true value of $30$.
We also have real intervals a.k.a credible regions (CR) on the estimates of the parameter values. In the Frequentist OLS model the 'confidence intervals' are created by fitting a Normal distribution over the point-estimate MLE values of each parameter. In the Bayesian inferential method, we can simply use the distribution of samples (after convergence) to learn the uncertainty in the parameter values.
2. Model posterior prediction
import matplotlib.pyplot as plt
def plot_posterior_cr(mdl, trc, rawdata, xlims, npoints=1000):
'''
Convenience fn: plot the posterior predictions from mdl given trcs
'''
## extract traces
trc_mu = pm.trace_to_dataframe(trc)[['Intercept','x']]
trc_sd = pm.trace_to_dataframe(trc)['sd']
## recreate the likelihood
x = np.linspace(xlims[0], xlims[1], npoints).reshape((npoints,1))
X = x ** np.ones((npoints,2)) * np.arange(2)
like_mu = np.dot(X,trc_mu.T)
like_sd = np.tile(trc_sd.T,(npoints,1))
like = np.random.normal(like_mu, like_sd)
## Calculate credible regions and plot over the datapoints
dfp = pd.DataFrame(np.percentile(like,[2.5, 25, 50, 75, 97.5], axis=1).T
,columns=['025','250','500','750','975'])
dfp['x'] = x
pal = sns.color_palette('Purples')
f, ax1d = plt.subplots(1,1, figsize=(7,7))
ax1d.fill_between(dfp['x'], dfp['025'], dfp['975'], alpha=0.5
,color=pal[1], label='CR 95%')
ax1d.fill_between(dfp['x'], dfp['250'], dfp['750'], alpha=0.4
,color=pal[4], label='CR 50%')
ax1d.plot(dfp['x'], dfp['500'], alpha=0.5, color=pal[5], label='Median')
_ = plt.legend()
_ = ax1d.set_xlim(xlims)
_ = sns.regplot(x='x', y='y', data=rawdata, fit_reg=False
,scatter_kws={'alpha':0.8,'s':80, 'lw':2,'edgecolor':'w'}, ax=ax1d)
xlims = (df['x'].min() - np.ptp(df['x'])/10
,df['x'].max() + np.ptp(df['x'])/10)
plot_posterior_cr(mdl_ols, trc_ols, df, xlims)
In the above code, we've used the parameter trace values to generate new predictions $\bf{\hat{y}}$ according to the model specification. We've then measured credible regions over that distribution and overplotted a line of fit and credible regions CR (in purple) on the original datapoints. The two credible regions shown are at 50% and 95%, meaning that we expect 50% of the datapoints to fall within the 50% CR and likewise for the 95% CR. You can see by eye that this is about correct.
Next steps
This was something of a whirlwind tour, but we've only just scratched the surface. My intention has been to start slowly from the basics, and demonstrate the differences and advantages of Bayesian inference vs Frequentist statistics - at least for very simple OLS Linear Regression.
In the next blogpost I'll use a real dataset and demonstrate the flexibility of Bayesian inference - and PyMC3 in particular - regarding model specification and validation.
Finally
I'd like to mention a new meetup group in London that I'm helping to organise alongside Markus Gesmann: it's called Bayesian Mixer, and will be dedicated to open discussion of Bayesian statistics in a normal presentation & social format on a semi-regular basis.
The first meeting is the evening of Friday 12 Feb in Old Street, London and you can read much more (and sign up) here on Markus' blog.
Jake Vanderplas has an excellent blog series on Bayesian vs Frequentist statistics. Essential reading. ↩
With whom we've had the pleasure of working, albeit briefly. Hello if you're reading! This appears to be the paper that kickstarted it all. ↩
You could say STAN is the gold-standard at present, although there's several frameworks available (JAGS, STAN, PyMC2/3) and specific samplers (emcee) and they all have various tradeoffs. I really like the ease-of-use of PyMC3, and there's a handful of comparisons available online to give you a background. ↩
Hat-tip to Thomas Wiecki for his blogposts on this very subject, from which I've borrowed one or two things. ↩