9 Questions To Determine If You Have A Good Data Science Ecosystem

It's 2016 and most financial services companies are at least starting to implement a data science capability, here's nine questions to define the maturity of yours.

Applied AI has been going for over three years now, and it's gratifying to see the financial services industry catch up with the idea of integrating data science into strategic decision making and day-to-day operations. Some organisations may already be quite advanced, while others - particularly in our focus industry of insurance - tend to be just getting started.

Previously, I've written here about our Data Science Maturity Model, and how to Deliver Value Throughout the Analytical Process. This blogpost is in the same vein, designed to elucidate what the core of a well-functioning data science capability can look like.1

Over the years, we have found that three critical processes lie at the heart of every data science project: data curation, machine learning, and business integration. These are high-level, and admittedly simplified names, but to my mind they're broadly distinct and complementary. Well-integrated, these logical steps form the core of a high-performance data science capability which looks quite different to business-as-usual analytics, reporting and even actuarial statistics.

Let's take a look through the three stages (data curation, machine learning, and business integration) and ask three critical questions at each. You may want to consider the final score for your in-house data science capability.2

Data Curation

A fancy name for the process of making the right data available for modelling and maintaining it well. The adage of garbage-in-garbage-out holds especially true in data science projects, and good data is vital.

Data Curation

Key Questions:

  1. Do you have a centralised, up-to-date, traceable, documented repository for structured text, tabular & image datasets?
  2. Do you augment your data with public datasets to keep up with competitors and gain an edge?
  3. Can you update, maintain and optimise your primary data sources to allow for high risk/reward POC projects?

Machine Learning

As the discipline of data science has grown, so have the number of names for the associated activities of analysis and prediction. We all know that naming things is hard, but lately I see the terms "artificial intelligence", "machine intelligence", "statistical modelling", "robotic process automation", "cognitive computing" and combinations of "supervised" / "unsupervised" / "reinforcement" and "deep" "learning" used almost interchangeably in products and services marketing.

Some of these terms are very specific disciplines3, and others are pure snake-oil, and it can be hard for non-techies to know the difference.

At the core, we're talking about learning from data, wherein a machine (aka computer or model) is trained upon a dataset to predict values in another. This is the empirical practice at the heart of statistics. We can use the final predictions or simply the learned parameters within the model to infer real-word behaviours. Hence I prefer the long-established general term "machine learning".

Machine Learning

Key Questions:

  1. Do you use sophisticated statistical techniques, good software development practices and research-grade, open-source software to create reliable, accurate models?
  2. Do you document and share knowledge with your team to become a technical centre of excellence?
  3. Do you regularly validate, test, review and maintain your data pipelines, software and models to mitigate risk and allow for audit?

Business Integration

A large amount of conventional business analysis lives and dies within spreadsheets and presentation documents. Expensive dashboards require unstable data pipelines. Huge data warehouses and "lakes" are so complicated they're barely utilised. Business integration is hard.

Business Integration

Key Questions:

  1. Do you have a clear path from model inference and predictions to the extrapolation of business actions and impacts?
  2. Do you regularly communicate results with non-technical stakeholders via engaging dashboards and visualisations?
  3. Can you fully integrate an automated, on-demand prediction service with live business systems?

In Review

How did you score? If you're a regular reader of this blog, chances are you did quite well. If not, maybe the questions are food for thought.

Our view is that spreadsheets, ad-hoc scripts and legacy systems are not the answer. We really want our clients to use an integrated approach to create high-quality analyses and fit-for-purpose prediction engines within a modular ecosystem.

To be even more explicit, this means: minimal usage of Excel, zero usage of VBA, zero usage of MS Access, careful and minimal use of proprietary analytics software, particularly legacy systems like SAS and SPSS. Modern, open source technologies are your friend.

Our answers for the above are:

  • On Data Curation: our clients typically possess a variety of datasets in separate repositories, at different levels of maturity & ownership: we join the dots and blend with external data to leverage the value within.

  • On Machine Learning: our skills in machine learning, particularly Bayesian statistics, let us create explainable models covering vital aspects including pricing, risk, reserving, marketing and customer lifecycle prediction.

  • On Business Integration: we can productionise such work to provide valuable audit, automation, testing and abstracted business integration through on-demand microservice architectures and interactive dashboards.

As always, please feel free to comment below, and you can read more and request case studies at our main website w​ww.applied.ai

  1. I've noticed one or two similar publications from earlier in the year, including this ebook from Booz Allen which I found interesting if a little wordy. Not enough pictures.

  2. Apologies for the clickbait title! I blame (legendary Bayesian stats guru) Andrew Gelman

  3. The variety of names for things is likely due to the confluence of disciplines as we have described before, melding: computer science, computer vision, linguistics, statistics, operational research, robotics, business intelligence and more.

Jonathan Sedar

Jon founded Applied AI in 2013. He's a specialist in Bayesian statistics and the application of machine learning to yield insights and profits throughout fintech, banking and insurance sectors.