How to Build a Data Science Business Function

Like any collaborative business effort involving research & development, a data science function should be built carefully in order to enable the best expertise and technologies.

Data science is a broad discipline and companies have differing data analysis requirements. These can vary from one-off scenario-specific modelling exercises to regular analysis of e.g. the effectiveness of an advertising campaign right through to live on-line predictive modelling of user actions. To achieve these goals the bedrock of a data science function needs to combine great people, good software and solid business processes. In this post I want to set out how we typically approach the building of a data science team.

At Applied AI most of our project work combines bespoke data analysis with lightweight high-level software built to interpret and explore the results. In our experience it is possible to achieve high-quality data insights and effect business change using the client's existing data without needing gigantic data sets. This can be achieved by using various techniques such as intelligent problem definition, representative sub-sampling and effective experimentation.1

The data science function in an organisation may be one person or many, spanning locations and project duration, mixing people in and out as skills and availability dictate. Like any collaborative enterprise involving research, design and development, a data science team should be built carefully and projects should be designed to achieve a particular aim, making full use of available expertise, existing facilities and the latest technologies.

Setting up the team

We've seen that the practicing data scientist will generally use a wide variety of tools in order to variously:

  • acquire, manipulate, store and access data efficiently
  • design surveys and scientific experiments to test hypotheses
  • undertake statistically valid analyses
  • implement high-quality, optimised predictive models
  • derive and communicate actionable insights

The activities above require diverse skills covering database management, software engineering, statistical analysis, machine learning, graphic design, business experience, ethics, social responsibility, domain knowledge and communication. However, the days of simply hiring a single, unicorn-like, 'full-stack' data scientist to solve all our problems are pretty much gone, and probably never really existed.

Whilst drafting this post I came across a handful of articles from the likes of HBR and Computerworld somewhat labouring the point that the data science business function requires a full team, and I completely agree, noting that it's always best to hire iteratively and size the team according to project scope and management buy-in.

For a start:

The team needs to be small, agile and focussed: 2-6 data scientists is ample, and they should be proven generalists, team-players and pragmatists able to cope with vague requirements, messy data and high failure rates (see below for project considerations). I like Forbes' opinion that the first hire(s) should "help get three things ready: your data; a clear problem to be solved; and a process to evaluate the business impact of any new solution".

  • Such highly-skilled people can be hard to find, but if you concentrate on the 'science' part of data science, it's possible to consider many candidates working in experimental science, industrial research & development and high-tech engineering - all disciplines which require a creative approach to learning from data
  • Ideally, some of these people will have experience within the company or at least strong experience in the industry - don't underestimate domain knowledge
  • The team also needs a well-respected sponsor within the organisation to help overcome failures, advertise successes and gain general buy-in from the board.

As the team grows:

The projects are likely to shift from one-off experiments into producing system-critical software, business process reengineering, tailored marketing & advertising, and reports for senior-management. Thus the team must grow and specialise accordingly, hiring for example:

  • data engineers / database administrators to source, clean and store the data - making it accessible, reliable, reusable and well-documented
  • computer scientists and software engineers to help scale algorithms to larger data sets, implement business rules or develop analytical applications
  • specialist statisticians and mathematicians to improve experiments and fine-tune algorithms
  • interface / graphic designers to help communicate insights
  • technical project managers to help organise the teams and deliverables
  • experienced technical people from other parts of the business - e.g. marketing or finance - who have organisational and domain knowledge

The steady state:

The data science function may grow to a significant size within the organisation, operating as a service to other departments and/or creating core features of the product and business processes.

The company may want to appoint a Chief Data Officer (CDO), leading the whole data science function and bridging the gap between the executive, financial, information and marketing leaders at board level.

Defining and operating projects

As I've mentioned before, a lot of practical data science looks like software engineering, and happily, there's a huge number of articles and established techniques discussing how to best manage these creative but technical projects.2 Our opinion is that any piece of research or development likely to last more than a few days and/or involve more than one person should be classified as a project, and should have:

  • A primary sponsor and a project leader, respectively responsible for commissioning and delivering the project. These may often be one and the same person, but it’s important to recognise the roles and thereby ensure that the project is well managed and the results used in the business

  • A well defined goal that is specific, measurable, achievable, realistic and time bounded (SMART), and a written & agreed specification document no matter how concise

  • Regular progress meetings throughout the project to validate and update the plan and scope, with full and frank communication between major stakeholders

  • Knowledge sharing upon completion - sharing lessons learned is very important but we often see this given low priority; formalised past learnings are incredibly useful for influencing future projects

  • Also consider maintaining a basic RACI (responsibility matrix) and a risks & issues register, so that problems can be addressed and resolved methodically.

Ensuring effective communication

Regular communication and team member visibility is also important, helping to ensure that the project stays on-track and issues are spotted early. Useful communication techniques include:

  • Daily stand-up meetings strictly limited to 10 minutes or less, where immediate activities and issues are shared by each team member to the rest of the team.3

  • An up-to-date communal task schedule - which team members are working on which activities and when: the Kanban methodology for visual project management demonstrates this well.

  • Simplified and centralised communications technology; try to move written discussions away from email and towards wikis, message boards, and group chats Slack

  • Try to allow data scientists / software engineers the time and space to get into a productive flow state without meetings and interruptions.

How to improve the Data Science Venn Diagram

According to it's creator Drew Conway, the Data Science Venn Diagram could be updated for the modern day by adding communication as a separately defined, vital skill

Systematising the data pipeline and analyses

Recently, I've noticed more articles and research papers written about systematising the data science function, such as the importance of automating workflows and dealing with technical debt when creating machine learning systems. This is a good sign, meaning that data science really is maturing into an everyday humdrum business function - and there's plenty of best practices out there for us all to follow including:

  • Understand and map the data 'pipeline' - the path from raw data sources to refined insights, and the human and machine consumers of this information - this will help reduce redundant analyses, identify fragile data and standardise processes

  • Stop when the models are good enough - avoid diminishing returns and overcomplexity - get to v1.0 quickly and iterate thereafter according to the needs of the information consumer

  • Encourage a systematic, shared approach to the creation of all machine learning tools and analyses - with proper source control and documentation, code reviews, 'lunch and learn' seminar sessions, and regular refactoring of algorithms, applications and data preparation scripts where appropriate.

In Summary

  • Start with a small team of capable generalists and work hard to define the business problems & success criteria, set timescales and understand & access the available data
  • Allow for and embrace failure, give data scientists time and space to research and experiment
  • Require a corporate sponsor with clout and encourage strong communication with the rest of the business
  • Specialise when necessary, automate where possible and embed into an ongoing cycle of development, maintenance and support.

We've covered a lot of ground here, so will likely revisit some of these topics in future and really get into the details.

  1. Regarding large datasets: currently more than a terabyte (1 TB = 1000 GB = $10^{12}$ bytes) would be considered quite large, although petabyte sizes are becoming more common (1 PB = 1000 TB). A terabyte allows for e.g. a collection of 250 million documents each roughly 500 words long, encoded as ascii-format raw text files (~4KB each).

  2. A brief primer reading list for articles, books and seasoned experience about software engineering and managing software development projects ought to include: Joel Spolsky; Michael Lopp; Jeff Atwood; J.D. Meier; Steve Yegge; Seth Godin and Scott Berkun. I've linked to a single post from each, but they've all written extensively and have covered many interesting topics - take a read through.

  3. Be careful with such meetings; keep the attendance specific to each team, keep the size small (<10 people) and avoid interrupting the day. I've seen such meetings become a useless protracted session for daily gossip and also used with the ulterior motive of enforcing employee 'core hours'.

Jonathan Sedar

Jon founded Applied AI in 2013. He's a specialist in Bayesian statistics and the application of machine learning to yield insights and profits throughout fintech, banking and insurance sectors.