Probabilistic Graphical Models for Fraud Detection - Part 3

We finish our series on Bayesian networks by discussing conditional probability, more complex models, missing data and other real-world issues in their application to insurance modelling.

In the previous article we introduced a very simple model for medical non-disclosure, set up the network in R using the gRain package,then used it to estimate the conditional probability of the necessity of a medical exam, $M$, given various evidence.

We discussed flaws of the model in its current form, observing how changes in declared health status affect the likelihood that a medical exam will discover issues impacting the underwriting decision for the policy.

In this final post, we investigate further problems with the model, focusing on getting better outputs and look into other potential uses for the model. In particular, we will look at ways in which we could deal with missing data and investigate potential iterations and improvements.

The Flow of Conditional Probability

A useful way to aid understanding of Bayesian networks is to consider the flow of conditional probability. The idea is that as we assert evidence - setting nodes on the graph to specific values - we affect the conditional probability of nodes elsewhere on the network.

As always, this is easiest done by visualising the network, so I will show it again:


Readers may wish to refer to the previous article for explanations of the nodes.

So how does this work? We have a built a model, so we play with the network and try to understand the behaviour we observe.

In the previous article we took a similar approach, but focused entirely on how such a toy model could be used in practice: setting values for the declared variables of each condition and observing the effect on the conditional probability for $M$.

This time we take a more holistic approach, observing the effect of evidence on all nodes on the network. For brevity here we will focus on just a handful of examples.

We start with the smoking condition, in particular the interplay between $HN$, $TS$ and $DS$ as that is easy to grasp and shows some interesting interactions. Here's the relevant code again from the previous post:

hn <- cptable(~HN  
             ,values = c(0.01, 0.99)
             ,levels = c("Dishonest", "Honest"));

ts <- cptable(~TS  
             ,values = c(0.60, 0.20, 0.20)
             ,levels = c("Nonsmoker", "Quitter", "Smoker"));

ds <- cptable(~DS | HN + TS  
             ,values = c(1.00, 0.00, 0.00  # (HN = D, TS = N)
                        ,1.00, 0.00, 0.00  # (HN = H, TS = N)
                        ,0.50, 0.40, 0.10  # (HN = D, TS = Q)
                        ,0.05, 0.80, 0.15  # (HN = H, TS = Q)
                        ,0.30, 0.40, 0.30  # (HN = D, TS = S)
                        ,0.00, 0.10, 0.90  # (HN = H, TS = S)
             ,levels = c("Nonsmoker", "Quitter", "Smoker"));

For clarity, according to the above CPT definitions, if a person is dishonest and has quit smoking, the probabilities of declaring as a non-smoker, quitter or smoker are $0.50$, $0.40$ and $0.10$. We can argue these levels1 later: for the moment we care more about how evidence and conditional probabilities interact than the values themselves.

We run a query on the network and calculate the two marginal distributions2 for $TS \; (True Smoker)$ and $DS \; (Declared Smoker)$.

> querygrain(underwriting.grain
             ,nodes = c("DS", "TS")
             ,type = "marginal");

 Nonsmoker   Quitter    Smoker
       0.6       0.2       0.2

 Nonsmoker   Quitter    Smoker
    0.6115    0.1798    0.2087

As expected, $TS$ is as set in the CPT, but $DS$ is a little more complicated as it includes the probability distribution values of $HN \;(Honesty)$.

What if we have evidence the applicant is dishonest, $HN = \text{Dishonest}$?

 > querygrain(underwriting.grain
           ,nodes = c("DS", "TS")
           ,evidence = list(HN = 'Dishonest')
           ,type = "marginal");

 Nonsmoker   Quitter    Smoker
       0.6       0.2       0.2

 Nonsmoker   Quitter    Smoker
      0.76      0.16      0.08

This means that knowing the value of $HN \; (Honesty)$ does not affect that value of $TS \;(True Smoker)$ at all, but does shift the probabilities for $DS \; (Declared Smoker)$ towards the healthier end of the values - which makes sense.

What if we have evidence the applicant is honest, $HN = \text{Honest}$?

 > querygrain(underwriting.grain
           ,nodes = c("DS", "TS")
           ,evidence = list(HN = 'Honest')
           ,type = "marginal");

 Nonsmoker   Quitter    Smoker
       0.6       0.2       0.2

 Nonsmoker   Quitter    Smoker
      0.61      0.18      0.21

Once again, $TS$ is unchanged but now there is a slight shift in the distribution of $DS$ to the healthier end of the scale, but less than before. This makes sense as 99% of people are honest according to the distribution for $HN$. This is largely accounted for in the marginal distribution of $DS$ without setting a value for $HN$.

It is obvious from the graph that $HN$ and $TS$ are independent, but what about conditional dependence on $DS$? Put another way, if we know the value of $DS$, are they still independent? We test this using the network. Fix a value for $DS$ and observe the effect of $HN$ and $TS$ on each other.

Before we do this, what is the effect of declaring as a Quitter $(DS = \text{Quitter})$ upon $TS$ and $HN$?

 > querygrain(underwriting.grain
           ,nodes = c("TS", "HN")
           ,evidence = list(DS = 'Quitter')
           ,type = "marginal");

  Dishonest     Honest
 0.00889878 0.99110122

 Nonsmoker   Quitter    Smoker
  0.000000  0.885428  0.114572

Interesting. There is little effect on $HN$, but the effect on $TS$ is stark. As it stands, if you declare as a Quitter, there is zero probability of $TS = \text{Non-smoker}$. This is a potential problem, arising from the CPT for $DS$. If you look at the probabilities for $DS$, declaring as a Non-smoker is probability $1.0$ if you are a non-smoker, and the probabilities for the other two levels are $0.0$.

Zero probabilities are problematic. Once one appears, it multiplies out as zero as well, propagating through the model. A first iteration of this model might be to make these conditional probabilities a little less extreme, and is why smoothing to small but non-zero values is helpful.

What if we know that $TS = \text{Quitter}$?

 > querygrain(underwriting.grain
           ,nodes = c("HN")
           ,evidence = list(DS = 'Quitter', TS = 'Quitter')
           ,type = "marginal");
  Dishonest     Honest
 0.00502513 0.99497487

We already had $DS = \text{Quitter}$ so adding the knowledge that $TS = \text{Quitter}$ makes it more likely that the applicant is honest.

Put another way, $TS$ and $HN$ are conditionally dependent on $DS$. Given that the network is built from conditional dependencies and CPTs, how do we read a Bayesian network to determine conditional dependencies?

Physical analogies are helpful. Imagine that gravity works in the direction of the arrows, evidence acting as a block in the pipe at that node. To consider the effect of a single node on other nodes, imagine water being poured into the network at that node. The water (the conditional probability) flows down the pipes as the arrows dictate. However, if the node is blocked by evidence, the water will start to fill up against the direction of the arrow.

Similarly, if the water approaches a node from 'below' (that is, against the arrow direction) evidence will also block it from flowing further upwards. We will show this shortly.

In our previous example, with no evidence on the network, the effect of $HN$ was unable to reach $TS$ as it flowed down through $DS$ to $SS$. With evidence on $DS$ however, it flows up the 'pipe' to affect $TS$.

Creating Data from the Network

The most tedious aspect of creating a Bayesian network is constructing all the CPTs, with potential for error. In many cases this is unavoidable, but we do have an alternative in situations with a sufficient amount of complete data. For instance, gRain provides functionality to use data alongside a graph specification to construct the Bayesian network automatically.3

This also works in reverse: we can use a Bayesian network to generate data. The ?simulate method works here:

> <- simulate(underwriting.grain, nsim = 100000);
> setDT(;
> print(;

             HN        TS         TB   TH        DS         DB           DH         SS         SB         SH         M
      1: Honest    Smoker       None None    Smoker     Normal         None    Serious NotSerious NotSerious NoMedical
      2: Honest Nonsmoker       None None Nonsmoker     Normal         None NotSerious NotSerious NotSerious NoMedical
      3: Honest Nonsmoker      Obese None Nonsmoker      Obese         None    Serious NotSerious NotSerious   Medical
      4: Honest   Quitter       None None   Quitter     Normal         None NotSerious NotSerious NotSerious NoMedical
      5: Honest Nonsmoker       None None Nonsmoker     Normal         None NotSerious NotSerious NotSerious NoMedical
  99996: Honest   Quitter Overweight None Nonsmoker Overweight         None NotSerious NotSerious NotSerious NoMedical
  99997: Honest   Quitter       None None   Quitter     Normal         None NotSerious NotSerious NotSerious NoMedical
  99998: Honest Nonsmoker       None None Nonsmoker     Normal         None NotSerious NotSerious NotSerious NoMedical
  99999: Honest    Smoker       None None   Quitter Overweight HeartDisease NotSerious NotSerious NotSerious   Medical
 100000: Honest Nonsmoker       None None Nonsmoker     Normal HeartDisease NotSerious NotSerious NotSerious NoMedical

With this data, we create the network connections and use the data to recreate the network. The DAG is specified by using a list of strings. Each entry in the list specifies a node in the network. If the node is dependent on others this is denoted as a character vector with the dependent nodes following the defined node.

We could specify the DAG for the underwriting network as follows:

underwriting.dag <- dag(list(  

To create a network using the DAG and the data, we do the following:

underwriting.sim.grain <- grain(underwriting.dag  
                               ,data =
                               ,smooth = 0.1

The unconditional output for $M$ should be similar for both:

> print(querygrain(underwriting.grain,     nodes = c("M"))$M);
   Medical NoMedical
   0.17793   0.82207
> print(querygrain(underwriting.sim.grain, nodes = c("M"))$M);
   Medical NoMedical
  0.176546  0.823454

So far, so good. What about applications with a clean bill of health?

 > print(querygrain(underwriting.grain
                   ,nodes = 'M'
                   ,evidence = list(DS = 'Nonsmoker'
                                   ,DB = 'Normal'
                                   ,DH = 'None'))$M);
   Medical NoMedical
   0.14649   0.85351

 > print(querygrain(underwriting.sim.grain
                   ,nodes = 'M'
                   ,evidence = list(DS = 'Nonsmoker'
                                   ,DB = 'Normal'
                                   ,DH = 'None'))$M);
   Medical NoMedical
  0.145173  0.854827

There are slight differences between the calculations, which is as expected. Most of this is likely due to sample noise, but the smoothing of zero probabilities will also have a small effect.

We could use bootstrapping techniques to get an estimate of the variability and gauge the effect of sample error on our probabilities. This is straightforward to implement in R but I will leave this be for now.4

Missing Data

One major issue often encountered in this situation is missing data.

Consider our non-disclosure model. A majority of policy applications are not referred for medical exams and so we never discover the $True$ nodes $TS$, $TB$ and $TH$ variables. The $HN$ variable is also likely untested and unknown.

We could just reduce our dataset to complete cases, but this means removing a lot of data, and potentially biasing results: there is no guarantee incomplete data has the same statistical properties as the complete data.

In this situation, how can we construct the network?

My preferred method is to mirror the approach of the Bayesian network: use complete data for each node to calculate the CPTs and build the network node by node. It is time-consuming and tedious for large networks but does make full use of the data.

The benefit of this approach is that it helps maximise the use of your data. You take subsets of the variables in the data and then use complete sets of these subsets to calculate the CPTs.

Alternatively, we can use more traditional missing data techniques such as imputation to complete the missing data, and then construct the Bayesian network as before. Such imputation has another level of complexity.

Finally, there is also the area of semi-supervised learning: techniques designed to help situations with very small amounts of labelled data in otherwise unlabelled datasets. The idea is to make best use of the small amounts of labelled data to either transduct the missing labels (unsupervised learning / pattern recognition), or induct the result dependent on these labels (supervised learning). This is a very interesting area, but we will not discuss any further here.

Iterating and Improving the Model

Our proposed model is far from perfect. In fact at this point, it is little more than a toy model - helpful for developing a familiarity with the approach and for identifying areas we need to improve upon, but far from the finished product.

Altering the CPT values

First we assume the network is acceptable, focussing on improving the outputs of the model. We do this by investigating the values set in the CPTs and seeing if they could be improved.

We already mentioned that the unconditional output of the model for $M$ (i.e. the probability of a TRUE value without any other evidence on the network) is a little high at around $16\%$.

Fixing this is not trivial. As discussed, the interaction of the different CPTs makes it tricky to affect output values. This seems annoying but is a feature of the Bayesian network approach, rather than a bug.

First of all, surprising results do not mean wrong results; our intuition can be unreliable. It is best to ask questions of the model and see if the outputs make sense. If they do not, we then investigate our data and ensure there is a problem. If there is, we focus on the CPTs involved in the calculation and check them.

Suppose the unconditional probability for $M$ seems too high, and we have reason to expect a value more like $12\%$. How do we go about tweaking this output?

Looking at the network, the nodes with the most influence on the calculation of $M$ are the CPT for $M$ itself, and the nodes above it on the network that condition the CPTs: $SS$, $SB$, $SH$.5

Let's try reducing the probabilities for $M = \text{Medical}$ a little:

m2  <- cptable(~ M | SS + SB + SH  
              ,values = c(0.99, 0.01        # (SS = S, SB = S, SH = S)
                         ,0.80, 0.20        # (SS = N, SB = S, SH = S)
                         ,0.85, 0.15        # (SS = S, SB = N, SH = S)
                         ,0.75, 0.25        # (SS = N, SB = N, SH = S)
                         ,0.80, 0.20        # (SS = S, SB = S, SH = N)
                         ,0.40, 0.60        # (SS = N, SB = S, SH = N)
                         ,0.70, 0.30        # (SS = S, SB = N, SH = N)
                         ,0.05, 0.95        # (SS = N, SB = N, SH = N)
              ,levels = c("Medical", "NoMedical"));

underwriting.iteration.grain <- grain(compileCPT(list(hn  
                                                     ,ts, tb, th
                                                     ,ds, db, dh
                                                     ,ss, sb, sh

querygrain(underwriting.iteration.grain, nodes = c("M"))$M;  
   Medical NoMedical
  0.133279  0.866721

We see that reducing the strength of the effect of the 'seriousness' variables on $M$ a little brings the probability down to just over $13\%$. Of course, this does not mean we should be reducing those probabilities, that is where domain knowledge can guide us.

Altering the Network

Often the model requires improvement more drastic than the tweaking of CPT values and requires altering the network, including adding or redefining variables or their relationships.

One quick improvement to the model might involve adding more medical conditions that affect pricing. Currently we consider three medical conditions and adding more is straightforward, the one issue being that additional 'seriousness' variables multiplicatively increases the conditioning levels for specifying $M$.

There is a potential problem with adding conditions. Going back for a moment to the physical analogy, we are adding extra channels into the $M$ variable, thus upwardly biasing the probability of a medical being a necessity. I am not sure how to fix this problem, but it is an issue I would pay attention to.

Ultimately though, we may need to move beyond a Bayesian network, as they work best for discrete-value variables.6 Many of the medical readings are continuous in nature, such as weight, height, blood pressure, glucose levels etc, and models should reflect this as much as possible.

To be clear, it is always possible to discretise the continuous variables in a network. This can be effective but loses information in the data. It is better to use the continuous variables directly.

DAGs may prove too limiting, as many biometric readings are likely to be interdependent, which is more complicated to capture with a Bayesian network. We could add some hidden variables to model these mixed dependencies via conditional dependence, but often involves questionable assumptions. Using cyclical networks for this may prove fruitful.

In short, the current model is really just the start: progress could be made along multiple approaches, all worthy of exploration at this point.

Conclusions and Summary

In this series of articles we looked at Bayesian networks and how they might be used in the area of fraud detection.

Despite obvious limitations, the approach seems sound and deals with major issues such as severe class imbalance and missing data in a natural way.

The 11-variable model we discussed, though simple, provides adequate scope for exploration and introspection, and is an interesting example of the approach.

I created the model on pen and paper in perhaps an hour, after a few aborted attempts, and we went through its implementation in R: nothing too complex. Despite this, it produced a number of surprising results and allowed us to learn some non-intuitive facts about the model.

I am happy with the outcomes produced, and I think Bayesian networks are an excellent way to start learning about probabilistic graphical models in general.

While not lacking for further topics, we will leave it here. A fit and proper dealing of the extra topics would take at least half a post in themselves, and may well be the topic of future posts as I explore the area further.

As always, if you have any comments, queries, corrections or criticisms, please get in touch with us, we would love to hear what you have to say.

  1. The 0.10 probability of declaring as a smoker is possibly too high, but this goes back to the importance of precise definitions for variables. I consider dishonesty here as a personality trait that creates a tendency to under-declare health conditions, rather than an absolute. My thinking is that even dishonest people can tell the truth or get forms wrong. This is definitely where subject matter expertise is hugely useful to direct modifications.

  2. In this article I use marginal distributions for illustration rather than the joint or conditional distributions. Whilst less direct, I think marginal distributions are easier to understand as it is just a discrete univariate distribution of probabilities and avoid the necessity of doing arithmetic. I would rather execute more code and have the answer stare right at me than open the possibility of misinterpreting the output. This stuff is subtle enough as it is.

  3. It is also possible to use this data to also derive the structure of the Bayesian network. This is known as structural learning or model selection. Code to do this requires additional packages and comes with quite a health warning - silly outputs are all to common. We will not discuss this further.

  4. The bootstrap, and resampling techniques in general, is a fascinating approach that I will probably cover at some point with a blogpost in the near future. It is a hugely useful tool in almost all areas of statistical modelling.

  5. As always, the code used in writing this series is available on BitBucket. Get in touch with us if you would like access.

  6. I may be wrong about this limitation with discrete variables. Please correct me if this is wrong.

Mick Cooney

Mick is highly experienced in probabilistic programming, high performance computing and financial modelling for derivatives analysis and volatility trading.