*In the final article of this series we investigate the behaviours of different distribution distance metrics to let us automatically determine the scale of the change.*

In the previous articles we proposed and discussed a fairly simple model for measuring and monitoring the effect of a hypothetical business process change. The scenario has us monitoring the conversion rate of phonecalls into sales as a binomial process and using statistical inference to estimate the underlying conversion rate in the data, $\theta$.

We used Bayesian analysis to track the change in distribution of $\theta$ and discussed the need for measuring differences between two distributions, settling on a fairly simple metric of measuring the common area between two distributions as a way to quantify their 'closeness'.

There's a lot of prior work in this topic, but it can be unintuitive. The fundamentals of concepts like statistical distance and f-divergence are subtle, and I have a strong preference for modelling defensively (keep it simple and understandable before getting more complicated). That said, I'm reminded of the old chestnut "a week of hard work can save you an hour of thought"^{1} so now is a good time to dive in!^{2}

# A brief aside on the maths

I want to take this chance to make a comment on doing the maths using formulae, derivations and analysis versus doing it via computation.

Using good formulaic mathematical expressions is 'correct' and generally makes things much clearer, but can be a hurdle to understanding. Mathematics is a hugely powerful tool, and one I still love immensely, but there is no point in using it if you cannot keep track of what is going on.

In many of these cases, the different distributions have their own formulas for calculating the statistical distances between two distributions based on their parameter values, but that then becomes one more formula to learn.

The beauty of the computational approach is none of the above is required, and the calculations work for arbitrary distributions assuming they have the same support. Of course, the dependency of the measures to the parameters can be unclear, usually requiring a process of evaluation and plotting to tease things out.

I tend to use both approaches but often start computationally as it is often much quicker, less prone to error for me personally (my algebraic manipulation skills are not what they were) and I develop intuition faster.

I will readily concede that this may just be a crutch, and am certain that other people will prefer doing it differently.

# Statistical Distances and Probability Distributions

As noted on Wikipedia, a *metric* or *distance* is defined to be a function that defines a distance between each pair of elements of a set:

$$d: X × X \rightarrow \bf{R}^+$$

... where $\bf{R}^+$ is the set of positive real numbers.

The *distance* has the following properties:

- $d(x, y) \geq \space 0 \forall x, y \in X$
*(non-negativity)* - $d(x, y) = 0 \space iff \space x = y \forall x, y \in X$
*(identity of indiscernables)* - $d(x, y) = d(y, x) \space \forall x, y \in X$
*(symmetry)* - $d(x, z) \leq d(x, y) + d(y, z) \space \forall x, y, z \in X$
*(triangle inequality)*

... note that conditions (1) and (2) together produce *positive definiteness*

These concepts are quite technical, but very important, and mean that we are using values that correspond to our intuitive notions of 'distance'. This way, we can ground the abstract concept of distances between probability distributions in something we can understand.

We don't have to use measures that obey *all* of the above properties - many don't obey symmetry and that can be perfectly acceptable - but most measures I have come across are at least *positive definite*, obeying conditions (1) and (2) above.

## The 'Common Area' measure

This simple measure can be made using the trapezoidal area underneath the two distributions, and without going into too much detail it seems to be on a sound footing:

- It is non-negative
- If the two distributions are identical, it will have a value of 1
- Symmetry is observed, as $min(x, y) = min(y, x)$
- The triangle inequality is a bit more complex
^{3}.

Note that we'll use a slight redefinition to $1 - common_area$ so that as distributions get less similar, the value increases from 0 to 1.

## KL-Divergence

Kullback-Leibler divergence aka KL-divergence is also known as *relative entropy* or *information gain* and is part of a larger family of functions on distributions collectively referred to as *f-divergences*. It is defined:

$$D_{KL}(P||Q) = \int p(x) \ln \frac{p(x)}{q(x)} dx$$

$D_{KL}$ is not symmetric and does not obey the triangle inequality but is nevertheless a commonly used measure of difference. It is best thought of as the amount of information lost when the distribution $Q$ is used to approximate $P$.

An alternative interpretation is that it measures the amount of additional information required to 'correct' the creation of a signal from $P$ when using $Q$. Thought of this way, it makes sense that this value would not be symmetric in $P$ and $Q$.

## The Hellinger Distance

The Hellinger distance aka the Bhattacharyya distance is defined:

$$H^2(P, Q) = 1 - \int \sqrt{p(x) q(x)} dx$$

Again, this hits the limit 1 when the distributions are totally dissimilar, and usefully, has a closed-form expression for the Beta distribution as well, so that, for $B_1(\alpha_1, \beta_2)$ and $B_2(\alpha_2, \beta_2)$ we have:

$$H^2(B_1, B_2) = 1 - \frac{B(\frac{1}{2}(\alpha_1 + \alpha_2), \frac{1}{2}(\beta_1 + \beta_2))}{\sqrt{B(\alpha_1, \beta_1)B(\alpha_2, \beta_2)}}$$

# Evaluating Distances Between Distributions

Now that we have our measures of distance, we will look at how these values change in a controlled way. I want to see how these values behave as we move around the distributions, and to paraphrase the great Eldon Tyrell from Bladerunner: "I want to see a negative before I provide you with a positive".

We will use a Beta distribution with the $(\mu,K)$ parameterisation, so that we have $B(\mu K, (1 - \mu) K)$. Note $\mu$ is the maximum likelihood estimate for $\theta$, whilst $K$ represents the strength of belief in the estimate, with higher numbers meaning stronger beliefs and hence narrower densities around $\mu$.

We will fix $\mu$ at 0.1, but try $K$ at 6,000 and 7,000, representing the change in distributions after about two months worth of additional data (recall that we assumed about 500 calls per month). We expect that to be quite narrow, so will then try it for 12,000 calls, representing a full year of data with unchanged $\mu$.

```
### Create data for a small change, two additional months
mu <- 0.1;
K1 <- 6000;
K2 <- 7000;
x_seq <- seq(0.05, 0.15, by = 0.0001);
Beta1 <- dbeta(x_seq, mu * K1, (1 - mu) * K1);
Beta2 <- dbeta(x_seq, mu * K2, (1 - mu) * K2);
staticmu_1_plot <- qplot(x_seq, Beta1, geom='line', xlab=expression(theta)
,ylab='Probability Density',ylim=c(0, 150)) +
geom_line(aes(y=Beta2), colour='red') +
geom_area(aes(x=x_seq, y=pmin(Beta1, Beta2))
,fill='grey', alpha=0.5);
### Now suppose we add a full year's worth of data
K3 <- 12000;
Beta3 <- dbeta(x_seq, mu * K3, (1 - mu) * K3);
staticmu_2_plot <- qplot(x_seq, Beta1, geom='line', xlab=expression(theta)
,ylab='Probability Density', ylim=c(0, 150)) +
geom_line(aes(y=Beta3), colour = 'red') +
geom_area(aes(x=x_seq, y=pmin(Beta1, Beta3))
, fill='grey', alpha=0.5);
```

Observe the common area (shaded) is getting smaller, but not by a huge amount since the central point is not moving.

To make this a little more rigorous, we can calculate the three distance metrics for these three distributions. Since Beta3 is 'more dissimilar' to Beta1 than Beta2 is, we expect to see $d(Beta1, Beta2) < d(Beta1, Beta3)$:

```
### Create data for small and large changes
x_seq <- seq(0, 1, by = 0.0001);
mu <- 0.1;
K1 <- 6000;
K2 <- 7000;
K3 <- 12000;
Beta1 <- dbeta(x_seq, mu * K1, (1 - mu) * K1);
Beta2 <- dbeta(x_seq, mu * K2, (1 - mu) * K2);
Beta3 <- dbeta(x_seq, mu * K3, (1 - mu) * K3);
print(calculate_metrics(x_seq, Beta1, Beta1));
commonarea hellinger kl
4.44089e-16 4.44089e-16 0.00000e+00
print(calculate_metrics(x_seq, Beta1, Beta2));
commonarea hellinger kl
0.03729146 0.00148335 0.00626134
print(calculate_metrics(x_seq, Beta1, Beta3));
commonarea hellinger kl
0.1660940 0.0290278 0.1534966
```

Observe the results for `calculate_metrics`

:

Sanity check: the distance from Beta1 to Beta1 is zero in all three cases (disregarding slight rounding errors in the floating point arithmetic).

We see the expected behaviour: Beta3 is 'further' from Beta1 in all measures than Beta2 is from Beta1.

We have something more too: a sense of size for these metrics when we use our data to measure the distances.

Now let's try something a little different, what if we move $\mu$ from 0.10 to 0.11?

```
## NOTE: In order to draw a fair comparison to our data,
## the first 6,000 values are taken from the original value of mu,
## and then the later values are based on the updated value.
x_seq <- seq(0, 1, by = 0.0001);
mu1 <- 0.1;
mu2 <- 0.11;
K1 <- 6000;
K2 <- 7000;
K3 <- 12000;
Beta1 <- dbeta(x_seq, mu1 * K1, (1 - mu1) * K1);
Beta2 <- dbeta(x_seq, (mu1 * K1) + (mu2 * (K2 - K1))
,((1 - mu1) * K1 + ((1 - mu2) * (K2 - K1))));
Beta3 <- dbeta(x_seq, (mu1 * K1) + (mu2 * (K3 - K1))
,((1 - mu1) * K1 + ((1 - mu2) * (K3 - K1))));
beta_distance_011_plot <- qplot(x_seq[x_seq >= 0.075 & x_seq <= 0.125]
,Beta1[x_seq >= 0.075 & x_seq <= 0.125]
,geom='line') +
geom_line(aes(y=Beta2[x_seq >= 0.075 & x_seq <= 0.125])
,colour='red') +
geom_line(aes(y=Beta3[x_seq >= 0.075 & x_seq <= 0.125])
,colour='blue');
ggsave(beta_distance_011_plot, file='beta_distance_011_plot.png'
,width=14, height=10);
print(calculate_metrics(x_seq, Beta1, Beta1));
commonarea hellinger kl
4.44089e-16 4.44089e-16 0.00000e+00
print(calculate_metrics(x_seq, Beta1, Beta2));
commonarea hellinger kl
0.1552227 0.0197388 0.0864029
print(calculate_metrics(x_seq, Beta1, Beta3));
commonarea hellinger kl
0.559182 0.262379 1.818925
```

As expected, we are getting much larger differences across all three metrics.

For a bigger change in $\mu$, we would expect yet bigger changes:

```
x_seq <- seq(0, 1, by = 0.0001);
mu1 <- 0.10;
mu2 <- 0.15;
K1 <- 6000;
K2 <- 7000;
K3 <- 12000;
Beta1 <- dbeta(x_seq, (mu1 * K1),((1 - mu1) * K1));
Beta2 <- dbeta(x_seq, (mu1 * K1) + (mu2 * (K2 - K1))
,((1 - mu1) * K1 + ((1 - mu2) * (K2 - K1))));
Beta3 <- dbeta(x_seq, (mu1 * K1) + (mu2 * (K3 - K1))
,((1 - mu1) * K1 + ((1 - mu2) * (K3 - K1))));
beta_distance_015_plot <- qplot(x_seq[x_seq >= 0.075 & x_seq <= 0.150]
,Beta1[x_seq >= 0.075 & x_seq <= 0.150]
,geom = 'line') +
geom_line(aes(y = Beta2[x_seq >= 0.075 & x_seq <= 0.150])
,colour='red') +
geom_line(aes(y=Beta3[x_seq >= 0.075 & x_seq <= 0.150])
,colour='blue')
ggsave(beta_distance_015_plot, file = 'beta_distance_015_plot.png'
,width=14, height=10);
print(calculate_metrics(x_seq, Beta1, Beta1));
commonarea hellinger kl
4.44089e-16 4.44089e-16 0.00000e+00
print(calculate_metrics(x_seq, Beta1, Beta2));
commonarea hellinger kl
0.655281 0.360161 1.956383
print(calculate_metrics(x_seq, Beta1, Beta3));
commonarea hellinger kl
0.999673 0.998076 39.199406
```

Since it looks like I am going to be cutting and pasting this code a little more, I have wrapped the above functionality into a library routine `generate_beta_comparison_data()`

that takes the various `mu`

and `K`

values and produces the data.

This makes it much more convenient to look at all these values together, as well as also investigating a change from $\mu = 0.40$ to $\mu = 0.45$:

```
mu1_vec <- c(0.10, 0.10, 0.40, 0.40);
mu2_vec <- c(0.11, 0.15, 0.45, 0.50);
# Generate data for each combination of mu1, mu2 and do transformations
# to create a data.table
comparison_lst <- mapply(generate_beta_comparison_data, mu1_vec
,mu2_vec, SIMPLIFY=FALSE);
comparison_data_dt <- data.table(mu1=mu1_vec, mu2=mu2_vec
,t(sapply(comparison_lst
,function(x) c(dist12=x$dist12,dist13=x$dist13))));
comparison_metric_plot <- qplot(sprintf("(%4.2f, %4.2f)", mu1, mu2), value
,data = melt(comparison_data_dt, c("mu1", "mu2"))
,xlab = expression(paste("(", mu[1], ", ", mu[2], ")"))
,ylab = 'Distance Metric') +
facet_wrap(~ variable, scales='free') + expand_limits(y = 0);
ggsave(comparison_metric_plot, file='comparison_metric_plot.png'
,width=14, height=10);
```

Now that we have started to develop a feel for how this all works, we can now move on to working with our data.

# Evaluating Differences in the Posterior Distributions

We are now at a point where we can put this to work on our generated data `stochastic_rate_count_data_dt`

. As in the previous article, we start with a beta prior with $\mu$ and $K$ approximately equal to what we calculated prior to the change, $\mu = 0.1$ and $K = 6000$ (assuming a full year's worth of data so our prior is not too strong), and then we watch the posterior beta month to month and see how the metrics evolve.

```
postchange_data_dt <- stochastic_rate_count_data_dt[
rate_date >= as.Date('2014-01-01')
,list(rate_date
,a=cumsum(conversion_count)
,b=cumsum(month_count)-cumsum(conversion_count))];
postchange_plot_data_dt <- melt(postchange_data_dt[
,data.table(t(calculate_postchange_metrics(
init_mu=0.1, init_K=6000, a, b)))
,by=rate_date], 'rate_date');
postchange_plot <- qplot(rate_date, value, data=postchange_plot_data_dt
, geom='line', xlab='Date', ylab='metric') +
facet_wrap(~ variable, scale='free');
ggsave(postchange_plot, file='postchange_metric_plot.png'
, width=14, height=10);
```

Observe that it is pretty clear that something has changed, just by a simple plot of how the distance metrics are behaving.

## Realtime aka 'online' evaluation

Doing this in real-time, with sales data coming in month by month, is a little trickier, and I would be reluctant to reduce the answer to a binary outcome ("yes, it is now better"), but this approach could be taken if desired.^{4}

To produce such a binary statement, I would first do a more comprehensive exploration of the distance metrics, and develop a threshold beyond which I would consider the underlying value of $\mu$ to be considered 'changed', and once the metrics cross the threshold, declare success. A more cautious approach could involve requiring a few consecutive months (3 seems reasonable) with the metric breaching the threshold before being satisfied.

That said, my strong personal preference is to look at it on a more nuanced basis, tracking the data in real time, looking at how the metrics are behaving, and finally deciding what changes, if any, have occurred.

In our data, our plots seem to support a change in $\mu$, and the evidence is strong after a few months, one important behaviour being that the metrics are increasing in value as time passes.

## Measuring changes with $\mu = 0.40$

Finally, I would like to see how the metrics behave for a smaller relative change in the value of $\mu$, so we will look at the data for a transition from $\mu_1 = 0.40$ to $\mu_2 = 0.45$.

Recall that while the change was the same in value, the relative change was much smaller in the case, and was well within the standard deviation of the change in $\mu$ from month to month.

How do our metrics look in this case?

```
highbase_postchange_data_dt <- highbase_count_data_dt[
rate_date >= as.Date('2014-01-01')
,list(rate_date, a=cumsum(conversion_count)
,b=cumsum(month_count)-cumsum(conversion_count))];
highbase_postchange_plot_data_dt <- melt(highbase_postchange_data_dt[
,data.table(t(calculate_postchange_metrics(
init_mu=0.4, init_K=6000, a, b)))
,by=rate_date], 'rate_date');
highbase_postchange_plot <- qplot(rate_date, value
,data=highbase_postchange_plot_data_dt
,geom='line', xlab='Date', ylab='metric') +
facet_wrap(~ variable, scale='free') +
expand_limits(y=0);
ggsave(highbase_postchange_plot, file='highbase_postchange_metric_plot.png'
,width=14, height=10);
```

Again, this is much stronger than I expected! The data is noisier, and is no longer an increasing function of time, but the trend is very strong and unambiguous. It appears that even with relatively small changes in $\mu$, this approach does a fine job of filtering out noise.

# Wrapping Up and Future Work

In this series I have only really scratched the surface of this topic, but here is a good place to stop: most of the key concepts have been introduced and discussed. Working on a real world dataset would be very interesting, using our new intuition for behaviour under known conditions. I am very curious to see how the noise of real data manifests in the model.

In all of this work, we have not strayed from using simple conjugate distributions: there was no need to do otherwise, and it has the considerable benefit of making the calculation of the posterior effectively instantaneous.

Relaxing this to allow for hierarchical models say, tracking different conversion rates for different call types, for example, or for salespersons or call centres, we need to use a sampler to calculate the posterior, such as Stan, JAGS or PyMC.

If we stick with our simpler model of a single conversion rate, we could work more on the how we 'decay' our data. Rather than recursively calculating the posterior by using previous posteriors as the new prior, we could start with a fixed prior that is uninformed, and use a rolling window of data points to calculate our posterior.

We started this approach when we used a fixed $\mu_0 = 0.1$ and $K = 6000$ but extensions to this approach would be interesting.

# Final Thoughts

This blog series started as just a quick conversation with a colleague that stuck in my head as interesting. A simple model, yet you can push it quite far and obtain some very interesting insights.

There is much more to be discussed here, and I will probably return to this topic in the future, but for the moment, this is a good place to stop.

As always, if you have any comments, criticisms or corrections, please let me know! I would also be very grateful if you know of any datasets that would be amenable to this approach - I would love to see how well it works.

This seems to be attributed to the father of Greg Wilson, a noted software developer, although I've seen this is several forms including "a week of engineering can save an hour of hard maths" etc. ↩

Not to mention the fact that I love teaching new stuff to people as that then forces me to learn it rather than adding it to the ever-growing list of "stuff I really should read and use some time" ↩

I have done a little bit of work on looking at the triangle inequality for this - see Appendix. It is interesting, and would be useful, but it also seems a little unnecessary for what we are trying to do. If I am wrong on this point, please let me know; I am not precious about my modelling and prefer to be corrected than incorrect. ↩

I generally dislike absolute statements that binary outcomes imply, especially in situations like this, as the real world is usually much more continuous and uncooperative towards such things. ↩

# Appendix: The Common Area Metric, the Beta Distribution and the Triangle Inequality

Here I'll quickly discuss a technical aside on the validity of the common area metric for the Beta distribution, and whether or not it obeys the triangle inequality. Please note this doesn't constitute a rigorous proof, but should give us good grounding and intuition.

For this discussion I am defining the common area metric $A_{xy}$ between two beta distributions $B_x$ and $B_y$ as:

$$ A_{xy} = 1 - \int^1_0 min(B_1, B_2) d\theta $$

For the triangle inequality to hold, we need $A_{12} + A_{23} \geq A_{13}$:

$$ A_{12} + A_{23} = 1 - \int^1_0 min(B_1, B_2) d\theta + 1 - \int^1_0 min(B_2, B_3) d\theta $$

Since the sum of two integrals is the integral of the sum, we have:

$$ A_{12} + A_{23} = 2 - \int^1_0 \left( min(B_1, B_2) + min(B_2, B_3) \right) d\theta $$

For $A_{13}$, we have:

$$ A_{13} = 1 - \int^1_0 min(B_1, B_3) d\theta $$.

So, for the triangle inequality to hold, we need $A_{12} + A_{23} \geq A_{13}$, and so, $A_{12} + A_{23} - A_{13} \geq 0$. Thus, we have:

$$\begin{eqnarray} A_{12} + A_{23} - A_{13} &=& 2 - \int^1_0 \left( min(B_1, B_2) + min(B_2, B_3) \right) d\theta - 1 + \int^1_0 min(B_1, B_3) d\theta \\ &=& 1 - \int^1_0 \left( min(B_1, B_2) + min(B_2, B_3) - min(B_1, B_3) \right) d\theta \\ \end{eqnarray}$$

So, the triangle equality holds if the integral on the last line is bounded above by 1. My intuition from geometrical reasoning is that it does hold, and we now have a basis from which to prove this in greater detail if required.