In the second article of this series we continue our technical discussion on using Bayesian analysis to probabilistically estimate the effect of a business process change.
In the previous article we established the problem of how to measure and monitor the effect of a change in business process. Our example is to track the conversion rate of sales calls into sales.
Starting with a very simple model and using a standard Bayesian updating approach, we iterated on it by adding extra levels of random noise, and monitored what effect this had on the posterior distributions for $\theta$, the conversion rate.
In all of the above treatment, we assumed a fixed conversion rate that had a discrete jump across the change, with the randomness in the data coming from the sample from the binomial trials, and then from the random count of sales calls made each month.
Before we continue, it is worth pursuing a quick digression into modelling this variation. Looking at the observed conversion rates, there is already randomness in the series, we are not getting two horizontal lines with a jump at the change point, but rather a single series that clearly shows two different mean values either side of that point.
So, from a conceptual point of view, is there any value in modelling another layer of randomness on top of this? Put another way, our planned next step is to first sample an underlying $\theta_i$ for each month $i$, then use that $\theta_i$ as the probability of the $n_i$ binomial trials for that month (and remember we also have a random process for the $n_i$'s). Is there any reason to do this? Would it not be simpler to somehow roll that randomness (if required) into either the binomial sampler, or the call count sampler?
These are all good questions, and the answers are all rather subjective.
I can say that it is probably worth trying to add more noise to our data. Our generated data for the conversion rates at the end of part 1 looked far too clean to me, so I think it is definitely worth trying to add more noise.
Also, I think it is worthwhile trying to 'decompose' the randomness into different layers. It makes it easier to understand, and it makes the problem more tractable in terms of data generation. We could probably do some investigation into using more sophisticated and generalised probability distributions that allow for more degrees of randomness, but that also adds a lot of complexity, reduces my ability to apply intuition, and incurs search costs too, requiring time to investigate and understand such distributions.
But the short answer to all of the above is: I do not know - this is why being able to design and evaluate demonstrations like this is so important!
Using a Stochastic Conversion Rate
So, we can re-use the earlier code, but this time with non-zero variance for the 'true' conversion rate:
set.seed(42); stochastic_rate_data_dt <- generate_process_rates(mu0 = 0.10, mu1 = 0.15, sd0 = 0.02, sd1 = 0.02); stochastic_count_data_dt <- generate_counts(stochastic_rate_data_dt, month_count = rpois(dim(stochastic_rate_data_dt), lambda = 500)); qplot(as.Date(rate_date), conversion_rate, data = stochastic_count_data_dt, geom = 'line', ylim = c(0, 0.2), xlab = "Date", ylab = "Conversion Rate");
So, as you can probably tell from the new plot of the conversion rate, there is a lot more variation, so much so that it is much less clear where the change occurs.
Like before, we then aggregate the data yearly and then plot the posterior betas:
stochastic_yearly_data_dt <- generate_yearly_data(stochastic_count_data_dt); qplot(theta, prob.dens, data = stochastic_yearly_data_dt, geom = 'line', colour = data_year, xlim = c(0.05, 0.15), xlab = expression(theta), ylab = "Probability Density");
We can see that the output of this is much noisier: the progression of the posterior betas from $0.10$ to $0.15$ after Jan 2014 is much less pronounced than before. This is not surprising, I would expect that additional randomness in the data reduces the clarity of the analysis.
Taking a Step Back
So we have some data, and we used that to create a sequence of Beta distributions tracking how our inference on $\theta$ has evolved over time.
However, at this point, we are going to drastically simplify our approach. Instead of building more complex Bayesian models, what does the yearly conversion rate look like?
It is easy to do, data.table makes it easy to create these summaries:
stochastic_empirical_rate_dt <- stochastic_rate_count_data_dt[, list(conversion = sum(conversion_count), month = sum(month_count), rate = sum(conversion_count) / sum(month_count)), by = list(year = as.numeric(format(rate_date, "%Y")))]; stochastic_empirical_rate_plot <- qplot(year, rate, data = stochastic_yearly_rate_dt, geom = 'line', ylim = c(0, 0.2));
Looked at this way, the change in underlying conversion rate seems much more clear cut, so why the discrepancy?
There is no contradiction. Think about what we were doing. Before the change, we were accumulating more and more data for a conversion rate of $0.10$ - and each year we were using the posterior distribution as the prior for the following year.
As a result, the prior was getting stronger and stronger, so after the change the new data is not able to affect the posterior as much. This latter fact on its own would be a dreadfully poor reason to take a new approach. In most cases, a strong prior is that way for a reason, requiring a bigger body of evidence (new data) to move it.
But in this case, is it right to use all the accumulated data as the prior for $\theta$ after the change point? The whole point of this issue is that something has changed, so it appears wrong to have such strong assumptions about the value of $\theta$.
Worse, none of this is due to the addition of randomness. It was a flawed idea from the beginning to use all the data prior to the change as an accumulated prior for inference on $\theta$.
That said, it is probably wrong to ignore all that data.
... So what to do?
What if, rather than rolling the data forward and using it to create the prior we will use to assess the effect of the change point, we instead use a different prior?
Determining the prior for the post-changepoint data
Going back to the data we see:
> stochastic_rate_count_data_dt[rate_date < as.Date('2014-01-01'), list(converted = sum(conversion_count), calls = sum(month_count), rate = sum(conversion_count) / sum(month_count))]; converted calls rate 1: 2381 23862 0.0997821
So, looking at all the data points before 2014, we see there were just under 24,000 calls, and just under 2,400 were converted, giving an overall conversion rate of something close to $0.10 == 10\%$.
Rather than using all the data, we will stick with the assumption of the value not changing, but with a weak prior and then see how it evolves as data comes in.
An alternative way of parametrising the Beta distribution is $B(\mu K, (1 - \mu) K)$ instead of $B(a, b)$ - so think of it in terms of the underlying rate $\mu$ and the number of trials that was performed $K$. The higher the value of $K$, the stronger our belief in $\mu$. In contrast, the $B(a, b)$ means we make $a + b$ trials, with $a$ successes: $a = \mu K$ and $b = (1 - \mu) K$.1
With that in mind, our post change prior is going to have $\mu = 0.0997$ (the average rate from our data) and we now have a choice for different values of $K$.
Firstly, I am going to try the average number of calls per year as my value for $K$. Since this data was generated from a Poisson distribution for each month with a mean value of 500, we would expect this number to be about 6,000, but it is good to do a sanity check and ensure this is the case:
> stochastic_rate_count_data_dt[rate_date < as.Date('2014-01-01'), list(converted = sum(conversion_count), calls = sum(month_count), rate = sum(conversion_count) / sum(month_count)), by = list(year = as.numeric(format(rate_date, "%Y")))][, mean(calls)];  5965.5
Using 6,000 seems pretty reasonable.
x_seq <- seq(0.075, 0.125, by = 0.0001); mu <- 0.0097; K <- 6000; prechange_prior_plot <- qplot(x_seq, dbeta(x_seq, (mu * K), (1 - mu) * K), geom = 'line', xlab = expression(theta), ylab = 'Probability Density'); ggsave(prechange_prior_plot, file = 'prechange_prior_plot.png', height = 14, width = 10);
Visualising this distribution, we see that most of the probability is concentrated between $0.09$ and $0.11$, so this seems to match our intuition.
Using this prior we can then restart the process of estimating our posterior using successive monthly data and see how quickly we see an effect:
sixmonths_data_dt <- stochastic_rate_count_data_dt[rate_date >= as.Date('2014-01-01') & rate_date <= as.Date('2014-06-30'), list(rate_date, a = cumsum(conversion_count), b = cumsum(month_count) - cumsum(conversion_count))]; sixmonths_newparams_dt <- sixmonths_data_dt[, list(rate_date, new_a = (mu * K) + a, new_b = ((1 - mu) * K) + b)]; sixmonths_plotdata_dt <- sixmonths_newparams_dt[, generate_beta_plot_data(new_a, new_b), by = rate_date]; # Create a character column for the date to help with plotting sixmonths_plotdata_dt[, plotdate := format(rate_date, '%Y%m%d')]; sixmonths_plot <- qplot(theta, prob.dens, data = sixmonths_plotdata_dt[theta >= 0.05 & theta <= 0.15], geom = 'line', colour = plotdate, xlab = expression(theta), ylab = 'Probability Density'); ggsave(sixmonths_plot, file = 'sixmonths_posterior.png', width = 14, height = 10);
The plots of the posterior distributions makes it very clear that the effect of the change becomes apparent almost immediately, and is moving towards $0.15$, as we would hope.
Much more promising, but there still is the nagging issue of how this behaves as the values change: our current set up does involve a very large relative change in signal. The conversion rate is moving from $0.10$ to $0.15$, which is a change of 50%.
What would happen if we started at a conversion rate of $0.40$ and moved to $0.45$; would such a change be much harder to detect?
Also, while all this is interesting, how realistic is this analysis?
We have the benefit of knowing our answers, so we can at least interpret what we are seeing somewhat coherently. How do we use any of this in the real world where we just have a USB key containing a bunch of CSV files with column titles like "n_Tom_Th"?
Well, the big part of the last question is an aspect of this I am not dealing with at all. Data cleaning is a crucial aspect of what we do, but it is so big and specific it will probably get a series all to itself at some point2, but the other questions we can try to work on.
Investigating Size of Change Effects
So what if the original underlying rate was $0.40$ and moved to $0.45$? Will the change be as easy to spot as before? Since it is easy to generate the data with the routines we have written, we can do that and see what happens:
highbase_rate_data_dt <- generate_process_rates(mu0 = 0.40, mu1 = 0.45, sd0 = 0.08, sd1 = 0.08); highbase_count_data_dt <- generate_counts(highbase_rate_data_dt, month_count = rpois(dim(highbase_rate_data_dt), 500)); plot_dt <- melt(highbase_count_data_dt[, list(rate_date = as.Date(rate_date), underlying_rate, conversion_rate)], id.vars = 'rate_date', variable.name = 'rate_type'); highbase_rate_plot <- qplot(rate_date, value, data = plot_dt, geom = 'line', ylim = c(0, 0.8), colour = rate_type, xlab = 'Date', ylab = 'Stochastic Conversion Rate'); ggsave(highbase_rate_plot, file = 'highbase_rate.png', width = 14, height = 10);
Note that I also upped the size of the standard deviation so that the amount of noise in the data relative to the original value stays the same.
So this simple change makes the picture much less clear!
Thinking about it, while disappointing, it is not surprising. The change in mean has gone from 2.5 standard deviations away to being less than 1 standard deviation3 so we would expect the underlying change to be much more obscured by the variation.
Looking at the line plot, it is now much harder to see any change in the underlying conversion rate. We will press on though, it is our modelling will reveal something our eyes cannot see.
Once again, we take the overall historical conversion rate as our $\mu$, we use a $K$ value of 6,000 (representing about a year's worth of calls, and check how our posterior evolves over time each month.
high_mu <- highbase_count_data_dt[rate_date < as.Date('2014-01-01'), sum(conversion_count) / sum(month_count)]; high_K <- 6000; highbase_data_dt <- highbase_count_data_dt[rate_date >= as.Date('2014-01-01') & rate_date <= as.Date('2014-06-30'), list(rate_date, a = cumsum(conversion_count), b = cumsum(month_count) - cumsum(conversion_count))]; highbase_newparams_dt <- highbase_data_dt[, list(rate_date, new_a = (high_mu * high_K) + a, new_b = ((1 - high_mu) * high_K) + b)]; highbase_plotdata_dt <- highbase_newparams_dt[, generate_beta_plot_data(new_a, new_b), by = rate_date]; highbase_plotdata_dt[, plotdate := format(rate_date, '%Y%m%d')]; highbase_plot <- qplot(theta, prob.dens, data = highbase_plotdata_dt[theta >= 0.35 & theta <= 0.50], geom = 'line', colour = plotdate, xlab = expression(theta), ylab = 'Probability Density'); ggsave(highbase_plot, file = 'highbase_posterior.png', width = 14, height = 10);
Okay, I admit it, I did not expect this! Despite all the noise in the time series, it looks like our Bayesian posterior shows a progression towards a higher $\mu$ value almost immediately. This is reassuring, as I expect a real-world problem to have data at least as messy as this data set, so I am now more confident that this approach could help detect a change.
Time to take the training wheels off and go back to the original problem: We have data on the conversion of sales calls into actual sales, and on Jan 1, a new process was implemented internally to speed up responses to queries. It is expected that our conversion rate will improve, and we want to measure if this expectation is met, and if so, by how much?
It is most likely that such a change will be monitored in real-time, and with no realistic target for the new conversion rate beyond "better", so how do we deal with this?
Since we do know the exact date of the change, and we do have a prior distribution for $\theta$ that we consider reasonable, we can collect the new sales data each month and track the evolution of the posterior, seeing how much it changes.
But how do we measure this change in distribution?
After all, we would expect to see changes just from the random variation in the data, so how do we decide that the difference is large enough to know it is not just due to chance?
If this question seems familiar, it is probably because this is exactly why statistical tests were created, but I am reluctant to go that route. Statistical significance is an important idea, but is not always as useful as you might think4.
We could just use something like the Kolmogorov-Smirnoff test, or perhaps one of the many f-divergence measures, and those have their place, but right now, I want to develop some more intuition for this, so I am going to start simpler: measure the area of the intersection of the two distributions.
At this point, I am not sure if this metric meets the standard requirements of a 'distance measure' - things like the triangle inequality - but it has the benefit of being simple, intuitive and easy to calculate, so I will start with that.
In mathematical terms, if we have two distributions $P(x)$ and $Q(x)$, I propose calculating the value of
$$ \int_0^1 min(P(x), Q(x)) dx. $$
A simple discretised version of this integral will work fine to calculate this value, and this is implemented via a function I have written
common_area_metric(x, P, Q). Having looked at this value, it might be interesting to then compare it to the KS-value for the two distributions as well. Since both values involve areas under the probability distribution function, albeit in slightly different ways, it is likely they yield similar conclusions.
We will consider all of this in more detail in the next article of this series, but for now, I want to see how this value behaves after the change:
base_dens <- highbase_plotdata_dt[plotdate == "20140101", prob_dens]; highbase_commonarea_dt <- highbase_plotdata_dt[, list(common_area = common_area_metric(theta, prob_dens, base_dens)), by = rate_date]; ### We set the limit for the y-axis to be above 1 as rounding ### could result in the value of the metric being slightly above that commonarea_metric_plot <- qplot(rate_date, common_area, data = highbase_commonarea_dt, geom = 'line', ylim = c(0, 1.05), xlab = "Date", ylab = "Common Area"); ggsave(commonarea_metric_plot, file = 'commonarea_metric_plot.png', width = 14, height = 10);
So, at a first glance at least, the common area beneath the curves seems to decrease as we get more data, so it is something we could consider.
First, we added more noise into our data, as our work in Part 1 was a little too clean for comfort - real life problems are typically much messier. We turned our underlying conversion rate into a stochastic variable and observed that the posterior distribution for $\theta$ still shows movement towards the new value.
Second, having realised that using all the data available before the change may not be the right approach, we instead constructed a weaker prior and observed how the posterior changed. This resulted in much better behaviour
Finally, we started thinking about how this problem would need to be dealt with in a real-life scenario, and started discussing how we might measure changes in the distribution of $\theta$.
In the next article we will investigate the behaviour of our measures of differences in distribution, developing some intuition for what kinds of time-dependent behaviours we could expect. That way, when we observe these behaviours in real-time, we will be much better equipped to interpret the results we are seeing.
It also tends to be highly task specific in terms of choices that are made, but I think it is a topic worth exploring all the same. ↩
This does not mean that there is no target of course, but the born-skeptic in me senses that such targets are most likely born from a curious cocktail of smooth-salesman-speak and wishful thinking. ↩
It is also a lot more technical and subtle than it first appears, and I want to give my poor brain a break. ↩