The term 'data science' has been around now for about five years with many explanations, discussions and occasional breathless over-excitement in the technology and business press. What is it, where did it come from, and who's using it today?
Data science is a wide ranging and rapidly evolving mathematical discipline with a general theme of letting humans understand more about a situation, predict real-world actions and identify patterns in data. It involves descriptive mathematical analysis, statistical modelling, rapid iterative experimentation, agile systems development and high quality communication.
The term 'data science' is a useful shortcut to describe the recent confluence and evolution of several previously distinct disciplines1, made possible by an increasing availability of data and sophistication of high quality open source software, decreasing costs of hardware and data processing, intense academic research and massive commercial and industrial interests.
The main aspects include:
Computer science (artificial intelligence, iterative software development)
Designing and creating novel software and hardware systems to create, store and efficiently analyse large volumes of data. Running advanced algorithmic processing to discover and disseminate patterns in data and predict new events.
Statistical science (experimentation and analysis, predictive modelling)
Researching new mathematical theory, designing algorithms and employing a scientific method to test them experimentally. Applying statistical theory to analyse past events or artifacts and predict future actions.
Graphic design and story telling (data visualisation, exploration and interaction)
Distilling hypotheses, insights and predictions into cogent arguments and communicating them effectively to a variety of casual, technical or managerial audiences. Bringing the data to life and the making the science accessible.
Leadership and domain expertise (industry experience, project management)
We often find that a data science task, project or programme will be the first of its kind within an organisation and it takes strong will, management buy-in, clear vision and solid communication to take on what's often a tough technical challenge with potentially huge benefits to the business.
"The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades." - Hal Varian, Chief Economist, Google.
The data scientist is best brought into the core of the business, working closely with technical, operational and leadership teams to help improve decision making and critical business functions.
As an individual, or more likely a small team, they will have considerable skill in rapid and powerful software engineering, know advanced statistics, have effective communication and deep subject-matter expertise.
It's potentially a very varied, highly skilled role, and the remit of a data scientist may cover, for example:
- a day summarising the effectiveness of a marketing campaign
- a week acquiring and cleaning a small dataset on competitor activity
- a month researching, building and testing a statistical model to help optimise a particular business function
- a year helping to build a massive-scale recommendation engine at the heart of an advertising platform.
The 'data science venn diagram' by Drew Conway of DataKind, IA Ventures, Project Florida and more, is a lighthearted but surprisingly accurate summary of skills required and regularly employed by a data scientist during the course of their work. In a recent discussion several years later, Drew reflected that the diagram is still very relevant and highlighted the additional vital importance of strong communication.2
"Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician." - Josh Wills, Director of Data Science, Cloudera.
The first industry to really make use of (and thus help define) data science has been the internet-oriented technology sector, making key progress in search and advertising (Google, Facebook), communications (Twitter, Skype), entertainment (Netflix, YouTube) and consumer retail (Amazon, eBay).
These companies and others have improved the state of the art in recommendation engines, natural language processing, data compression, game-theoretic auctions, massive psychological experiments, human computer interaction, campaign analysis, user profiling and more.
Naturally, statistical modelling and data analysis is a critical, core capability for the typically more conservative pharmaceutical, telecoms and financial sectors too. These companies have been conducting drug discovery, network optimization and predictive modelling for a long long time, and to assume they aren't familiar with large statistical data analysis would be foolish.
However the sheer abundance of new technologies, tools and techniques available today cannot be underestimated making possible all sorts of high value analysis and modelling that simply wasn't practical in years past.
Today’s advanced analytics in insurance pushes far beyond the boundaries of traditional actuarial science... While the impetus to invest in analytics has never been greater for insurance companies, the challenges of capturing business value should not be underestimated. - McKinsey & Co. Unleashing the Value of Advanced Analytics in Insurance
The insurance sector in particular is all about risk modelling and data analysis, and so is a natural fit for a data science approach.
Now is the time for insurance companies to take advantage of the past five years of rapid development in the data science discipline to make wide-ranging improvements throughout their businesses: improving customer satisfaction, meeting compliance requirements, reducing risks and increasing revenues.
We'll write specifically about the opportunities for the insurance industry to make best use of data science in future posts.
The field of data science has grown to such an extent that dedicated books are starting to appear with first generation data scientists passing down their knowledge and experiences, for example the Data Science Handbook. ↩