In this series of articles we'll discuss the use of probabilistic modelling to help efficiently evaluate changes to business processes and the success of marketing campaigns.

A common business problem is to evaluate the effect of a change of process. A bottleneck has been identified, a process has been improved, or some needless work has been eliminated and you now expect that things are going to get better.

So how do you measure such a thing? If at all possible it's good to be able to quantify the improvement - ideally in real-time - so that the feedback loop can be accelerated. Bayesian statistics provides a very natural approach.

Discussion of the problem

For the sake of simplicity, I'm going to assume the process change we're making has a binary outcome1. Let us presume the problem is one of sales call conversions: we are trying to measurably improve the conversion rate of sales calls from initial contact to product sale. Whilst there's potential for added complexity - accounting for the type, size, location of the sale etc - we will start with a simple model. We don't need to get any more sophisticated in order to allow for ample complexity and nuance!

I have a small pet peeve about articles of this nature, which I'll try to address here. Most of these techniques and methods are very general and when discussing them the use-case is often skipped as "it doesn't matter, this approach works for any case where the above conditions apply." While technically true, this attitude always irritated me for two reasons:

• Firstly, it ignores the most important aspect of any modelling problem: domain knowledge. Different problems may suggest similar approaches, but the various idiosyncrasies of the problem will affect things in often subtle ways, which I don't like to ignore

• Secondly, if the illustration or demonstration of a technique or model is more general but lacking in specifics, it is more difficult for me to add that specificity than it is to just switch contexts.

To that end, I am going to assume that we have internal systems that monitor our sales calls, tracking the conversion rate of those calls into actual sales. Our data is summarised monthly, and so our data is in the form of a time series of numeric values that range from 0 to 1.

At a known point in time, the internal processes were improved, eliminating a bottleneck that allowed the firm to respond faster to requests for quotes and we now wish to implement a method for tracking how effective that change is at converting sales calls.2

Note that this problem is subtly different from change-point analysis since we have no ambiguity about the location of the change - we know exactly when it occurred.

Generating the data

There is one major hurdle with illustrating any example scenario - lack of realistic data. However, with computation so cheap, that's a simple to deal with: we'll just generate our own. In fact, when dealing with an unknown problem or technique, it can be very useful to generate your own data with known parameters in order to test properly3.

Again, we start simple. We assume our process has a single probability before the change point, and a new probability level after it. We also assume that a constant number of new calls comes through each month; we'll further extend this to a random number of new calls and investigate how this additional level of randomness affects the analysis.

Our initial conversion rate is modelled as a normal distribution with mean $\mu_0$ and standard deviation $\sigma_0$. After the change point, the mean is $\mu_1$ and standard deviation $\sigma_1$. To start with, we will assume that only the mean conversion rate changes: $\mu_0 \neq \mu_1$ and $\sigma_0 = \sigma_1$. We will create a function in R to generate the synthetic data and then check the output to ensure it looks the same.

Bayesian Analysis and conjugate distributions

Bayesian analysis extends the concept of Bayes' Rule, which is a way of calculating conditional probability. Suppose we have two events $A$ and $B$, and we want to know the probability of $A$ occurring, given that we know that $B$ has occurred:

$$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$$

The Bayesian approach is fashionable of late, due to the recent convergence of improved theory and computational power.

In 'classical' statistics4 we perform inference mainly by producing point estimates of the values (or parameters) of interest, and calculate standard errors around that point estimate. Confidence intervals are produced, and tests of statistical significance supplied for good measure.

The Bayesian approach is different. For inference we start with a 'prior distribution' for the values, use a likelihood model to calculate probabilities for the observed data, and then combine the two to produce a 'posterior distribution' for the parameter.

In Bayesian data analysis we extend this concept to estimating the value of an unknown parameter $\theta$. We know we have the observed data $D$, and we can state a hypothetical likelihood $p(D|\theta)$ of seeing this data given the value of $\theta$. What we want is the other way around, $p(\theta|D)$, the probability of getting a value of $\theta$ given the data $D$ - our posterior distribution. In the continuous case, this becomes an integral:

$$p(\theta | D) = \int p(D | \theta) p(\theta) d\theta.$$

In the above equation, we integrate the prior distribution $p(\theta)$ and the likelihood $p(D | \theta)$ to find the posterior distribution $p(\theta | D)$. The Bayesian approach may initially seem odd but is a very natural way of estimating probability, taking into account your prior belief of an event and the observed data in a balanced way.

Calculating the integral is non-trivial, and we'll certainly post more on the subject of sampling and estimating the posterior distribution in future. An accompanying standard approach to make the integrals more tractable is to ensure that the posterior distribution $p(\theta | D)$ has the same functional form as the prior $p(\theta)$; the data simply updates the parameters of the distribution.

Before we continue, let's ask the question: why use the Bayesian approach?

I like it for a few reasons:

• By setting a prior distribution, you have a lot of freedom to capture your initial presumptions for the value of the parameter.
• For simple models, the use of conjugate priors for the likelihood make the calculations straightforward.
• The output is the posterior distribution, meaning you naturally obtain estimates for the uncertainty of the results of your analysis. Calculating standard errors are not necessary.

The Beta distribution

In our process change example, we use a standard binomial model: each sales phone call is considered a 'trial', and we will estimate the underlying probability of the phone call being converted into a sale. This probability is denoted $p(\theta)$ and we will make $n$ phone calls per month.

If we notate a successful phone call conversion as $y=1$ and unsuccessful as $y=0$, our likelihood $p(y|\theta)$ of seeing a successful conversion given $\theta$ is:

$$p(y|\theta) = \theta^y (1 - \theta)^{1-y}$$

It's then reasonably simple to extend the above for $n$ calls with $k$ successes:

$$p(k|\theta) = \left(\frac{n}{k}\right) \theta^k (1 - \theta)^{n-k}$$

The Beta distribution is the conjugate prior for the binomial likelihood. It is parametrised by two values $\alpha$ and $\beta$, corresponding roughly to the count of successes and failures respectively5.

So, for a given Beta prior, $B(\alpha, \beta)$, and data that consists of $n$ trials with $k$ successes, our posterior distribution for $\theta$ is

$$p(\theta|D) = B(\alpha + k, \beta + n - k)$$

Most of the well-known probability models have conjugate distributions.

Bayesian monitoring of the conversion rates

The initial approach assumes a constant rate of phone calls each month, so in each month we have a constant count of binomial trials. We also start with a constant success rate each side of the change point. Thus, the only source of randomness comes from the outcome of the sequence of binomial trials. We then use this data to see how good our approach is in 'rediscovering' our synthetic data.

If we use a beta distribution as our prior it is conjugate to the likelihood, which makes it simply a matter of updating the $\alpha$ and $\beta$ terms with our success and fail counts to get the posterior.

In practice, we would do this every month, but this makes the plots quite unwieldy, so we will aggregate up to yearly data and plot that.

NOTE: All of the code in this article series is available in a git repository on Bitbucket. Please do ask us for access.

Let's look at some code:

set_conversion_rate_data_dt  <- generate_process_rates(mu0 = 0.10, mu1 = 0.15,
sd0 = 0, sd1 = 0);
set_conversion_count_data_dt <- generate_counts(set_conversion_rate_data_dt,
month_count = 500)

qplot(as.Date(rate_date), rate, data = set_conversion_count_data_dt,
geom = 'line', ylim = c(0, 0.2),
xlab = 'Date', ylab = 'Set Conversion Rate')


set_conversion_yearly_data_dt <- generate_yearly_data(set_conversion_count_data_dt)

qplot(theta, prob.dens, data = set_conversion_yearly_data_dt,
geom = 'line', colour = data_year, xlim = c(0.05, 0.15),
xlab = expression(theta), ylab = "Probability Density")


Looking at this plot, it seems pretty clear that for the first few years our posterior value for $\theta$ converge on $0.10$. This is the value we set when creating the data, so the model passes the first sanity check. For 2014 and 2015 (for which we only have 3 months of data) we can see the posterior curves for $\theta$ immediately starting to move towards $0.15$, the other value we set for the synthetic data, so the second sanity check is passed.

Adding extra noise to the data

Unfortunately, real life is nowhere near as neat as above, so the first thing we will try is adding randomness to the count of phone calls made each month. We would normally experiment with a range of different distributions, but for brevity, but I will start with a Poisson distribution6 with $\lambda = 500$:

monthly_count_rate_data_dt  <- generate_process_rates(mu0 = 0.10, mu1 = 0.15,
sd0 = 0, sd1 = 0);

monthly_count_count_data_dt <- generate_counts(rate_data_dt,
month_count = rpois(dim(rate_data_dt)[1], lambda = 500));

qplot(as.Date(rate_date), conversion_rate, data = count_data_dt,
geom = 'line', ylim = c(0, 0.2),
xlab = "Date", ylab = "Conversion Rate")


As you can see from the conversion rates, they are now a random signal around the mean levels but it is pretty obvious that the rate has changed. Plotting the posterior beta, we see a similar pattern as before, with the first group of years being centred around $0.10$ and then moving up towards $0.15$ after 2014.

yearly_data_dt <- generate_yearly_data(count_data_dt)

qplot(theta, prob.dens, data = yearly_data_dt, geom = 'line',
colour = data_year, xlim = c(0.05, 0.15),
xlab = expression(theta), ylab = "Probability Density")


Next steps

In the next article we are going to extend this model further, adding the random underlying conversion rate to the data generation process.

We also want to investigate how the time structure plays out: how does the size of the change (not to mention changes in variance) affect the time taken to see the change?

We also need to discuss the appropriateness of using all the data before the change in determining our prior distribution for $\theta$.

1. Most articles deal with binary outcomes and coin tosses since they are simple to grasp and reasonably simple to analyse mathematically, but to paraphrase a friend I once discussed this with, I do intend to graduate from coins to dice sometime soon.

2. The initial draft of this article did not specify the nature of the internal change made since "it didn't really matter". I am aware of the irony of such thinking given my above point...

3. This will be the topic of a future post.

4. The Bayesian approach and interpretation has been around for as long as probability theory itself, but was largely ignored until very recently due to the computational difficulties in using it well.

5. The Beta distribution is a conjugate prior for many distributions within the exponential family of distributions, and is frequently used in Bayesian statistics.

6. The Poisson distribution is a common default choice when modelling time-to-event data when the events are expected to occur independently and with a normal-distributed (Gaussian) average rate.

Cover image by Matt Buck