Feature Reduction using SVD

In this technical article we explain why and how to use Singular Value Decomposition (SVD) for feature reduction: making large datasets more compact whilst preserving information.

It's a common occurrence in data analysis to encounter a dataset that almost certainly contains some useful information, but also contains entirely too much low-quality information. Isolating the signal from the noise can be achieved in a variety of ways, including heuristic filtering, anomaly detection, and agnostic data shaping a.k.a. feature reduction.

This latter technique is especially relevant to record-based data, where an item or event is recorded with many pieces of information, and it's likely some of these features1 are not needed for use in prediction. This blogpost focusses on a technique called Singular Value Decomposition (SVD), which is often used for feature reduction.

As a starter-for-ten, let's imagine we have a environmental dataset containing many various measurements of land and plantlife across a large national park.

  • If our task is basic navigation for hiking, we only really need just 2 numbers, the latitude and longitude. We can ignore any other features through simple heuristics (and why did we bring such a large dataset on a hike?)
  • However, what if we're now trying to classify chunks of land into various usage types? Here's where we would want to make use of all the available information recorded about an area; potentially hundreds of features.2

If, for example we wanted to create a model to determine scrubland from forest, what features should we use from a dataset including:

height above sea level,  
slope angle,  
sunlight hours,  
wind speeds,  
air temperature,  
soil types,  
rock types,  
density of various plant types,  
density of various animal types,  
etc etc.  

Good feature reduction will ensure we use as much information as possible, whilst keeping our datasets manageable and our models simple.

Approaches for feature reduction include:

  • To use LASSO (L1 regression) to perform constrained optimisation, automatically dropping features from a linear model3. This of course, requires that we define a linear model, which may not be possible or appropriate
  • To use a Decision Tree (or variant tree-like structures) to identify the features most often used for splitting the classes. This again is goal-seeking and requires some of our data to have a classification label or a sought value
  • To use a variance-maximising method such as Principal Component Analysis (PCA). This simply tries to transform the data into a new set of orthogonal components, ensuring that the first component aligns to the maximum variance in the dataset, and subsequent components align to the next maximum orthogonal variance, and so on.

PCA is a useful way to 'compress' data: the outputted components behave just like normal dimensions, but they are an orthogonal linear combination of the original features, preserving decreasing amounts of the original variance. Thus we can create an approximate description of the original dataset by using only the first few principal components.

The conventional method for calculating PCA requires us to compute the full covariance matrix, and so it suffers from memory complexity and can be numerically unstable. Happily, it's possible to perform PCA by using SVD which is a purely numerical technique based on simple linear algebra - thus more stable and can be optimised for speed.

SVD is relatively straightforward, but the linear algebra can be a little tricky to fully grok. We use it quite often on projects and find it incredibly useful, so what follows here is a background of the linear algebra and a couple of worked examples to give you some intuition.

SVD Theory and Worked Examples using the Jupyter Notebook

As in previous posts, I've written up this technical post in the form of a Jupyter Notebook rendered and embedded below. Please read through each section for a thorough explanation of the various techniques and analyses, and why I've used them.

If you want to try it yourself, please feel free to clone the repo for the Notebook, accessible here on Bitbucket.

In Review

SVD is not a new technique and is classically used in several fields including signal compression and information retrieval. It is, however, very useful for all sorts of analysis projects where reducing data volumes is required.

We've stepped through worked examples in detail to show how we can decompose an arbitrary matrix $A$ with $m$ rows and $n$ columns into a set of related matrices:

$$A_{mn} = U_{mm} S_{mn} V_{nn}^{T}$$

We can use reduced versions of these matrices to recreate an approximated version of $A$, using fewer features $k < n$:

$$A_{mk} = U_{mm} S_{mk} V_{kk}^{T}$$

... this feature reduced version $A_{mk}$ retains as much variance as possible from the original dataset $A$, whilst using fewer features. Thus we can choose to store less data and use simpler models.

We've also shown how we can use functions in numpy and scipy to do all the heavy lifting for us. These functions are well tested and robustly engineered, and are capable of efficiently dealing with many situations including very large and sparse matrices.

After looking into the fundamentals, we demonstrated using SVD for something slightly more interesting - though still small and simple to understand - compressing the information in the classic iris dataset to allow scatter plotting and simplified class distinction.

SVD is a powerful technique and I find it often very useful when investigating a new dataset and creating new models. Try it for yourself and feel free to get in touch to discuss anything in more detail.

  1. Like many, I tend to use the terms 'feature' and 'dimension' interchangeably. I take both to mean original measurements of some datapoint, which may or may not co-vary.

  2. See our earlier blogpost on Visualising High-Dimensional Data for a worked example using exactly this kind of land-classification data.

  3. LASSO can be a really useful technique and is still very much in use, see for example this recent paper.

Banner picture by Scot Campbell, used under Creative Commons 2.0 license.

Jonathan Sedar

Jon founded Applied AI in 2013. He's a specialist in Bayesian statistics and the application of machine learning to yield insights and profits throughout fintech, banking and insurance sectors.