The practicing data scientist will be familiar with a wide range of software for scientific programming, data acquisition, storage & carpentry, lightweight application development, and visualisation. Above all, agile iteration, proper source control, good communication are vital.
As we outlined in the previous post, the discipline of data science exists at a confluence of applied mathematics, software engineering, information visualisation, storytelling and domain expertise. A lot of one's time will be spent at a computer and a lot of that time will be spent writing code, so it's critical to use the best tools available: powerful hardware, modern software technologies and established engineering methodologies.
Let's take a quick, high-level tour of some of the technical considerations:
The core equipment
Fast, flexible data analysis starts with excellent software. Over the past ten years, R and Python have become two of the most important core technologies in the data science toolbox1. Both are open-source programming languages with a huge ecosystem of supporting code libraries and packages for data processing and statistical modelling.
R has grown organically from the statistics community and is widely used and praised for a rich set of industry-standard algorithms, publication-quality plotting, rapid prototyping and native functional programming. Whilst powerful, it is perhaps best thought of as an interactive environment for statistics with a high-level programming language attached; there's almost a tradition within academia to release an implementation of one's novel algorithm as a new R package, which coupled with R's inherently muddled syntax and culture of poor documentation can make for a daunting initiation to newcomers and regular frustration for software engineers.
Python is a very popular general-purpose high-level programming language with syntax that's considered intuitive and aesthetic, but a runtime that can be slow compared to compiled languages like C++ and Java etc. The creation in 2005 of NumPy, a library for very fast matrix numeric computation, spurred the use of Python within the computer science / machine learning communities who might traditionally use MATLAB or C. In the years since, a wealth of best-in-class open-source libraries have been developed for data manipulation, efficient computation and statistical modelling. When this new facility is coupled with Python's tradition for excellent documentation, well-maintained open-source code, strong developer communities, consistent syntax, and an ethos of 'batteries included', it has become an increasingly default choice of key software for data scientists.
Numerical processing has of course been around for many years and there's a whole suite of different legacy environments and languages available including MATLAB, SPSS, Stata and SAS. These closed-source tools commonly have expensive licensing, surprisingly conservative development cycles2 and reduced functionality when compared to open source. The high economic barriers to entry limit the size of the user base, leading to fewer contributors, a smaller community and reduced sense of ownership for practitioners.
There are a handful of large companies further undermining the cause for using closed-source software by packaging, customising and selling enterprise-ready distributions of the above open-source tools that are bundled with their own technical support, consulting and library extensions. Two such companies are Revolution R, recently acquired by Microsoft, and Continuum Analytics who continue to make major developments in the Python community.
Finally, just to mention MS Excel: we've all been through the pain of trying to use spreadsheets for something too complex. It's initially very tempting to 'use what you know', and businesses also often rely on Excel files as a primary datastore for accounts, marketing, reports and more. To put it simply, spreadsheets are the wrong tool for data analysis & predictive modelling and should be avoided wherever possible. In more detail, spreadsheets are a poor choice because:
- spreadsheets are stored in binary format and can't be easily used with source control to provide critical audit and versioning
- calculations are written individually to cells and not visible in bulk, thereby encouraging accidental bugs and making code maintenance and review very difficult
- calculations are performed upon specific cells, and this lack of variable substitution makes it difficult to test code using dummy data
- code and data are stored in the same file, risking the loss of everything in the event of a runtime or user error
- advanced users will typically end up writing VBA code when calculations need to be sufficiently complex. VBA is a macro-scripting procedural language very ill-suited to numerical computation and software engineering; it has a whole host of issues and is no longer actively supported by Microsoft. Recoursing to writing VBA is a sure sign that your spreadsheet is too complicated and it's worth investing time to tackle the problem properly.
I just don't care about proprietary software. It's not "evil" or "immoral," it just doesn't matter. I think that Open Source can do better ... it's just a superior way of working together and generating code. - Linus Torvalds, Interview on GPLv2, 2007
How does Big Data fit into this?
"Big data" is often mentioned alongside data science and while there are certainly technical crossovers and shared goals, it's important to treat the subjects separately. The ability to store and efficiently manipulate a huge amount of data is incredibly useful, but 'big data analytics' often concerns itself simply with providing counts of events rather than statistical modelling, e.g. counting sales volume of an item by location by date, or counting page requests on a website.
NoSQL storage and map-reduce data processing have been around for a long time now, there's many ways to do it and many tools available. Hadoop, HBase, Cassandra, MongoDB, Redis, Riak, Redshift, BigQuery, Mahout, Spark etc. all have worthwhile use cases depending on the nature and volume of data to be stored and processed and we won't go into them here.
In a recent talk, Wes McKinney observed that the Python data science ecosystem still doesn't have a great story to tell about 'big data' and there's very much a need to interface well with high performance big data systems. We agree, but it's worthwhile to remember that one can gain deep insight and develop highly predictive models with only small-medium datasets. Intelligent surveying, balanced subsampling, advanced modelling and even simple human communication can often help solve the business issue without requiring us to process terabytes of mean averages.
"One only needs two tools in life: WD-40 to make things go, and duct tape to make them stop." - G. Weilacher
A lot of data science looks like software engineering
Statistical data processing can take a lot of horsepower, but will certainly also require a great deal of thought and human computer interaction. Fortunately we now live in a world where memory and storage are fast and cheap, processors are multi-core and large high-resolution displays are available. Getting the right tools for the job is essential.
The data analytics function within any company needs to have excellent desktop hardware3 and capital expenditure here will be rewarded many times over in improved speed and sophistication of computation, breadth of analysis possible, and depth of knowledge gained.
Smaller datasets and simpler algorithms may not pose difficulties when using your well-specced local machine, but when dealing with larger datasets4 or complex models, it's wise to consider separate server hardware. As noted above, RAM and processing is quite cheap these days, so building a powerful in-house server is reasonable. External cloud-based servers are worth considering for their ability to scale on demand, reducing capital outlay for short-term projects. That said, holding certain data outside the corporate firewall often requires a layer of legal arrangements and regulations that may make it not viable at all. We'll write about our approach to massive and efficient data anonymisation in a future blog post.
Source control has long been an integral part of software engineering and is naturally of vital importance in data science. As teams grow and models are increasingly implemented in production systems rather then one-off analyses, proper source control is critical to provide code versioning, code review, auditability, continuous integration testing and more. Even for one-off analyses undertaken by just one person, these standard working methodologies will preserve the code and may help greatly on the next project5.
Distributed version control tools like Git and Mercurial are the way to go; they're powerful, widely supported and easy to implement into the development process.
"The key word in 'data science' is not data, it's science" Jeff Leek, SimplyStatistics.org, 2013
Know your toolset
Good tools for data science provide a framework for discovering new insights and solving problems not previously possible.
To recap the technical considerations:
- use open source tools with a strong community and solid implementations of basic and cutting-edge analytical techniques
- maintain a well-organised, version controlled code base with issue tracking and wikis
- strive for repeatability, code review, testing and audit
- use powerful local machines and consider scalable hardware where possible
- iterate quickly and try to simplify the problem before throwing more processing at it.
We'll no doubt elaborate on the above in future posts about the technical aspects of running a data science department and certainly when discussing particular examples of our work. For now, thanks for reading!
Honourable mention must go to Julia and Clojure as alternative languages/environments for general statistical programming. Julia is a relatively new language with a MATLAB-like syntax and many optimisations including just-in-time compilation that make it incredibly fast for numeric computation. Clojure is a lisp-like, functional language that can run closely alongside Java on the JVM and thus scale easily to massive data processing tasks. Both have found niches in the mathematical and computer science communities respectively, but both are still young and don't have anywhere near the breadth of packages and community uptake as R and Python. ↩
At Applied AI we're set up with hardware and software to allow us to easily work from anywhere with an internet connection. As an aside, we almost exclusively use software-as a-service (SaaS) tools for internal business operations and where suitable use cloud-based virtual servers for on-demand data processing and modelling. This helps to keep the company lean and flexible, and we'll write more about all that in future. ↩
As a rule of thumb, it's reasonable to consider using a server when your dataset to be processed grows larger than 1/3 of the available memory in your machine. For example, a laptop may have 12GB free after the OS, so the rule-of-thumb 4GB would roughly correspond to a dataset of 40M rows x 12 numeric features, each array element represented as a double precision 8 byte float. This is not a particularly large dataset, and it's quite easy for the user or an algorithm to create copies or transformations of the data during processing thereby consuming memory. ↩