Visualising data is important for aiding intuition & good understanding, but high-dimensional datasets can be hard to display. Here we demonstrate techniques to tackle the issue.
When presented with a new dataset, it's standard practice for an analyst to both summarise and plot the raw data in order to gain an understanding of the distributions, value ranges, clusters and behaviours etc.
In fact, visualising information is vital throughout the entire analysis process - from those first investigations of the raw data, through data preparation, to interpreting the outputs of machine learning algorithms and presenting insights to the project team, sponsors and senior stakeholders. However, real world datasets are often complicated, and those with high dimensionality are especially difficult to summarise and plot.
In this technical blog post I will demonstrate a recently developed manifold-learning technique called t-Distributed Stochastic Neighbour Embedding (t-SNE) which allows us to map high dimensional data to a low dimensional representation for plotting.1 Along the way I will also demonstrate techniques for plotting and summarising high dimensional data, and reducing the size of the dataset.
Analysis and reporting using the iPython Notebook
As noted in a previous post we are big fans of 'reproducible reporting' i.e. combining code, analysis and commentary in a single document that can be executed by a third party to reproduce the results of the analysis.
I've created such a reproducible report in the
iPython Jupyter Notebook rendered and embedded below, please read through each section for an explanation of the various techniques & analyses and why I've used them. If you want to try it yourself, please feel free to clone and download the repo for the Notebook, accessible here on Bitbucket.
We have sourced and prepared a small but high-dimensional dataset and demonstrated:
- The difficulty of visualising the raw data and distributions in high dimensions
- The benefits of using feature reduction to detect and remove degeneracy and reduce redundancy
- The difficulty of visualising even the prepared data at 50 dimensions, because it simply has too many dimensions for straightforward 2D and 3D plotting
- The benefit of using t-SNE to create a low dimensional representation of the high dimensional data, and how the local structure is preserved such that class clusters can be easily seen.
t-SNE is a powerful technique, making it simple and intuitive to visualise structure in high-dimensional data, and is a great part of the every day toolkit for data science projects. The standard implementation scales poorly at $O(n^2)$, but a recent development using a Barnes-Hut approximation drops this to $O(n*log(n))$ and we have successfully and easily used t-SNE in a handful of projects with datasets far larger than here. For more information please do read through Laurens van der Maaten's homepage which includes explanations, links to the relevant academic papers and techtalks.
As noted on the relevant scikit-learn page, manifold-learning is a general name for non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high and the data can be mapped to a lower dimensional manifold without large loss of information. Several other options for manifold learning of course exist, and it may be that e.g. Locally Linear Embedding (LLE) or Isomap may suit your data better than t-SNE. ↩