In the second article of this technical series we demonstrate the flexible syntax of PyMC3 with regularized linear modelling of car emissions data and model evaluation.

In the previous blogpost I described where I see Bayesian inference fitting into the data scientist's toolkit, and described some of the theory with a worked example of linear regression on a dummy dataset.

In this post I'll focus on a little more theory and few more features of PyMC3 which allow for feature selection and model validation. In the final post I'll look into hierarchical linear models.

Of course, Bayesian inference is a huge topic and this short series of three blogposts is hardly the best medium for conveying the full complexity: I won't cover more than a few things here. However they will hopefully serve as a reference with links to help you the reader go off and do your own research.

A full reading list for Bayesian inference is worthy of an article by itself, and in fact that's exactly what Alex Etz et al. have very recently published (online and in print) in a comprehensive overview of 36 papers and books. Looks like a good place to start.

Curiously they don't mention several books in the canon, I'll note some here:

• Doing Bayesian Data Analysis, John Kruschke. Future classic, very applied and practical, more info online, see also this notebook containing PyMC3 ports of all the code examples
• Data Analysis using Regression and Multilevel / Hierarchical Models, Gelman & Hill. This one is quite applied and practical, more info online
• Bayesian Data Analysis, Gelman et al. This is very comprehensive with a steep learning curve, more info online
• Information Theory, Inference, and Learning Algorithms, David MacKay. Getting a little old now, but it covers a lot of ground and is very well written, available for free download

For PyMC3-related material, you might want to read:

If I've missed anything above, let me know and I'll add to the list.1

# Let's find an interesting dataset

In a throwback to a previous technical series on survival analysis I'm looking for data again.

We're currently quite interested in road traffic and vehicle insurance, so I've dug into the UK VCA (Vehicle Type Approval) to find their Car Fuel and Emissions Information for August 2015. The raw dataset is available for direct download and is small but varied enough for our use here.

I will investigate the car emissions data from the point-of-view of the Volkswagen Emissions Scandal which seems to have meaningfully damaged their sales. Perhaps we can find unusual results in the emissions data for Volkswagen.

# Regression, Feature Selection and Model Evaluation

For brevity, I'll skip the Data Engineering steps (Acquisition, Cleaning, Sampling, Feature & Item Engineering, Quality Control) and launch right into...

1. Data exploration and creating an OLS Linear Regression
2. Basic feature selection using Lasso Regression - a good excuse to demonstrate the ease of changing model specifications
3. Further discuss constraints and demonstrate a Ridge Regression
4. Basic model evaluation and comparison using the built-in Deviance Information Criterion (DIC) and Posterior Predictive Checks

# Summary & Next Steps

That was rather a long Notebook! I set out to determine if there's anything strange about Volkswagen's NOx emissions results, and I'll freely admit to getting somewhat distracted by demonstrating some of the flexibility of PyMC3 and methods for evaluating results.

I covered quite a lot of ground:

• Quick data exploration, inc 1d, 2d and Nd distribution plotting
• OLS Linear Regression, comparing Frequentist and Bayesian models
• Lasso L1 Regularized Linear Regression, comparing Frequentist and Bayesian models
• Ridge L2 Regularized Linear Regression, comparing Frequentist and Bayesian models
• Naive Model Comparison, inc DIC and Posterior Predictive Checks (MSE, R2 and KS-test)

I don't think we're at all ready to pass judgement on whether Volkswagen's emissions data looks odd, but there are two interesting clues in the Notebook:

• The first is that the feature-value mfr_owner_is_vw[T.True] was selected by the Lasso model (both Frequentist and Bayesian versions) indicating that it has a non-negligible effect
• In the Ridge model (both versions) this feature-value mfr_owner_is_vw[T.True] coefficient receives a value quite far above zero, and indeed the vast majority (>>95%) of its distribution lies above zero.

In the next blogpost and Notebook, I'll demonstrate using hierarchical a.k.a. mixed a.k.a partial-pooled models to directly compare the NOx emissions amongst all the manufacturers and the parent companies too.

# Finally

In the previous blogpost I mentioned a new meetup group in London called Bayesian Mixer which I'm helping to organise alongside Markus Gesmann. On Friday 12th Feb we hosted our first meeting upstairs at The Artillery Arms near Old St with 2 short talks from Markus and myself2, and great conversation amongst the ~20 attendees about tools, techniques, interesting Bayesian statistics and just plain old socialising. Markus wrote up more on his blog here.

It was such a success that we've decided to create a community here on meetup.com to help us organise and publicise the next events: the next is on Friday 15th April in Central London. Please feel free to join up and come along for open discussion of Bayesian statistics in a presentation & social format on a semi-regular basis. We hope the group will be a useful place for academics, practitioners and learners alike. We're particularly interested to hear stories about real-world projects, worked examples, and new or unconventional techniques.

1. Shameless plug for this very blog. Mick Cooney in particular writes some great articles.

2. Markus spoke about assessing risk with small datasets Experience vs Data - a version of which he also presented at R in Insurance last year. I spoke about Robust Linear Regression with Outlier Detection - based on a notebook I recently created for the PyMC3 examples folder.