*Real data is often unavailable for creating demos, learning and especially for publishing. Here we describe methods to generate realistic artificial data which has fewer constraints.*

Data protection and data privacy concerns are crucial in the modern world, rightly requiring that enterprises owning and processing data about customers and the public adhere to legal and also ethical constraints. Many enterprises find it complex and time consuming to fine-tune their data storage and processing protocols, and so play it safe by enforcing very conservative rules about when and where the data can be accessed.^{1}

A lot of our work at Applied AI is focussed on the insurance industry where customer data is especially sensitive - often including financial, medical and demographic information. The constraints surrounding such data are of course important, but have the side-effect of making it tricky to create realistic demos and present new analytical concepts since the models & analyses need data.

All is not lost. This obstacle is more a Le Petomane Thruway^{2} than the Himalayas since with modern computing it is easy to generate fake data.

Generating fake data (aka synthesising realistic, but information-free data) is something we have discussed before and is a very good method of becoming familiar with new analysis techniques since you can check if your methods are capable of recovering the parameters you used to create the data. This assumes your data generation is the reverse of the modelling process which is not always the case, and we will discuss this further later in this post.

# Creating a Synthetic Book of Life Insurance Policies

Our aim is to create a book of fake policy data that at least superficially resembles the policies owned by an average, medium-sized life insurance company. We could do it the simple way: figure out some basic features that are common to policies such as gender, smoking status, date-of-birth and policy start date, but we also want to attach some geospatial data to our policies in a meaningful way.

To make this even more realistic, lets pick a country - Ireland - and tailor the synthetic data to the the socioeconomic geography there. It means we build conditional dependencies into the data generation which is more involved than just randomly creating data, but is tractable.

We will use the outputs of some of our prior work creating socioeconomic clusters of Irish society based on data from the 2011 census - data provided by the Irish Central Statistics Office (CSO). The census asks a wide range of questions, per household, about the backgrounds, educations, jobs and daily lives of the occupants. After collection, processing and basic analysis, the CSO makes this data publicly available, albeit the household information is aggregated geographically into 18,000 'small areas'. A 'small area' consists of approx. 400 neighbouring households. Our prior work uses unsupervised machine learning to automatically cluster these small areas into larger groups by similarity in the census answers. Thus by knowing a customer's location, we can assign them to a group and make a reasoned estimate of their socioeconomic profile. Aggregating further, we can group customers by policy type and start making statements about the socioeconomic profiles of policy types. We can then start generating data for that policy.

## Policy Types

Most insurance companies offer a wide variety of different products to their customers, but we will focus on four main types: protection policies, pension policies, savings policies and alternative retirement funds. In all four cases, there is a lot of variation in products within a type, but we ignore those distinctions in this project.

**Protection:**The standard life protection policy. The policy holder pays a fixed number of regular premia and in return, the holder's dependants receive an agreed amount, the*Sum Insured*, if the policy holder dies within the lifetime of the policy**Pension:**Pension products are investment products aimed at saving for retirement. Generally treated beneficially by the tax authorities, they are otherwise similar to Savings policies**Savings:**Saving policies are investment products used by people expecting to earn a return on their investment. Usually less restricted in terms of choice of how the money is invested, they are otherwise very similar to Pension policies**Alternative Retirement Funds:**Purchased by retired individuals as an alternative to buying an annuity, ARFs are funds of capital where the owner draws down a certain amount of the fund each year but the rest of money remains invested, allow for further capital growth. These products are predominately owned by older, more affluent clients

## Initial Policy Generation

As we discussed, we will generate this data in sequential steps:

- We decide the overall balance of cluster types across the book, and then proportionately allocate a cluster id to each policy
- We then randomly allocate each policy to a 'small area' according to a matching cluster type. If desired, we can use the geospatial data supplied by the census to randomly choose a lat/long coordinate for the policy, but this slows down the data generation considerably, and may be omitted
- We then also assign socioeconomics to each policy according to their cluster id.

The random allocation is weighted by population size; each small area has population statistics, so the sampling is weighted by that size. This is somewhat of a simplification as sampling by household count is a better mapping as policies are generally sold by household.

It is possible to calculate these values from the census data but for simplicity we will sample by population counts.

Each cluster has its own distribution of policy types. This is the distribution of the proportion of policy types in the book conditional that the policy is owned by household within the given cluster, and is unique to each cluster. We sample the policy type for each policy from this conditional distribution.

```
> cluster_product_mapping_dt[cluster_level == 'n6',
.(cluster_id, protection, pension, savings, arf)]
cluster_id protection pension savings arf
1: c0 0.59 0.25 0.15 0.01
2: c1 0.69 0.20 0.10 0.01
3: c2 0.59 0.10 0.30 0.01
4: c3 0.79 0.10 0.10 0.01
5: c4 0.20 0.40 0.30 0.10
6: c5 0.70 0.10 0.15 0.05
```

## Generate the Policy Data

Once we have the policy type, we can then use the selection to help generate other data for the policy, such as:

- policy start date
- policy premium aka 'annual premium equivalent' (APE)
- policy is single-coverage or joint-life
- customer date of birth (based on a sampled age of the holder at the start of the policy)
- customer gender
- customer smoking status
- ... and so on.

```
> product_data_dt
prod_type age_min age_max age_mean age_sd prem_sp_prop prem_multiplier prem_min
1: savings 18 90 40 10 0.2 5 600
2: protection 16 70 40 10 0.0 1 120
3: pension 18 65 40 10 0.3 5 600
4: arf 50 90 70 10 1.0 50 1000
```

# The Data Generation Routine

I have wrapped most of this logic into a single function that allows an arbitrary number of policies be created. This code is currently stored in a BitBucket repository, so if you are interested in some of the finer details of this process, we can make access available upon request.

I will not detail the full routine `create_policy_data()`

but there are a few interesting snippets from it worth discussing.^{3}

There is nothing complex in the code about any of the random data generated: the age of the policy holder is generated via a normal distribution using parameters set by the policy type, and the policy premium is drawn from a gamma distribution^{4}.

The age of the policy is drawn from a Poisson distribution with a given mean age (10 years seems like a reasonable default). Having set the year, we then randomly choose a day within that year to get a spread of dates.

First, given a simple proportional sampling for each cluster from the proportions we have set, we then want to sample a small area for that cluster weighted by population size. To do that, we use the `total2011`

as our weighting values. We could just use the totals as weights directly in `sample()`

. That should work, but there are often thousands of small areas in each cluster, so that gave me a bad feeling.

Instead, I took a cumulative sum of the distribution, made a random draw from the uniform distribution $(0, N_{tot})$ and then found the index of the small area corresponding to the index of the cumulative sum vector that first exceeds the random draw. The output of small areas should then be approximately distributed according to the population size.

Once we have the small area, we then look up the proportions for that cluster type and draw the product type for this policy.

```
#### Then we determine the small area of the policy holder
sample_cluster_data <- function(iter_id) {
### First we sample from the appropriate small areas to get some data
N <- length(cluster_id[cluster_id == iter_id]);
use_dt <- smallarea_dt[cluster_id %in% iter_id];
pop_cumsum <- cumsum(use_dt$total2011);
sample_val <- sample(1:pop_cumsum[dim(use_dt)[1]], N, replace = TRUE);
idx <- sapply(1:N, function(iter) { length(pop_cumsum[sample_val[iter] <= pop_cumsum]) })
sa_id <- use_dt[idx]$sa_id;
### Now we use the product data to calculate the policy type
use_dt <- cluster_prod_dt[cluster_id %in% iter_id];
prod_type_values <- use_dt$variable;
prod_type_prop <- use_dt$value;
prod_type <- sample(prod_type_values, N, prob = prod_type_prop, replace = TRUE);
cluster_data_dt <- data.table(cluster_id = iter_id, sa_id = sa_id, prod_type = prod_type, idx = idx);
return(cluster_data_dt);
}
```

To create some individual data for the policy, we look up the aggregated data we have set for each product type. We then use these parameters as inputs to various distributions for data we will need. For example, the age of the policy holder at the start of a policy is set by product type, and we use a normal distribution for these ages. This is largely a question of choice, but it seems intuitive that ages of policy holders for a given type would be normally distributed.

For the policy premium it is our experience that this data is often heavy tailed, so as I mentioned before, I chose to use a Gamma distribution. It is possible using this data will lead me to amend or tweak the way I do this random generation.

```
### We now create some policy data
sample_product_data <- function(iter_type) {
use_dt <- product_data_dt[prod_type == iter_type];
N <- dim(policy_dt[prod_type == iter_type])[1];
policy_start_age <- round(rnorm(N, use_dt$age_mean, use_dt$age_sd), 0);
policy_start_age <- pmax(use_dt$age_min, policy_start_age);
policy_start_age <- pmin(use_dt$age_max, policy_start_age);
prem_ape <- rgamma(N, shape = premape_shape, scale = premape_scale) * use_dt$prem_multiplier;
prem_ape <- round(prem_ape, 2);
prem_ape <- pmax(use_dt$prem_min, prem_ape);
rpsp_prop <- use_dt[, c(1 - prem_sp_prop, prem_sp_prop)];
prem_type <- sample(c("RP", "SP"), N, prob = rpsp_prop, replace = TRUE);
prod_data_dt <- data.table(prod_type = iter_type
, prem_type = prem_type
, policy_startage = policy_start_age
, prem_ape = prem_ape
);
return(prod_data_dt);
}
```

Having determined the age of the policy holder at the start of the policy (using a Poisson distribution for a given mean), along with a random start date for the policy we can then determine the date of birth of the policy holder by subtracting the ages from `policy_startdate`

. This seems a little convoluted, but it means the input data is much more intuitive: separating the age when the policy holder bought/started the policy, and the age at the time of the data snapshot. The age at policy start is inevitably correlated with product type.

The date arithmetic looks like:

```
### Create some datetime values
policy_dt[, policy_age := rpois(.N, policyage_mean)];
day_modifier <- sample(1:365, N, replace = TRUE) - 1;
policy_dt[, policy_startdate := data_date - (365 * policy_age)];
policy_dt[, policy_startdate := policy_startdate - day_modifier];
day_modifier <- sample(1:365, N, replace = TRUE) - 1;
policy_dt[, dob_life1 := policy_startdate - (365 * policy_startage)];
policy_dt[, dob_life1 := dob_life1 - day_modifier];
```

Finally, we may with to convert some of the character vectors to factors to help reduce the memory usage. If memory is not a big problem we can skip this.

At the end of this we have an example book of policies that we can use as inputs to some visualisation apps we have been working on.

Sample output of the routine is shown below:

```
> policy_dt
policy_id countyname edname nuts3name sa_id cluster_id prod_type prem_type prem_ape policy_startdate
1: C010000082 Dublin City North Dock C Dublin A268109008 c0 protection RP 1331.92 2007-04-25
2: C010000152 Dún Laoghaire-Rathdown Dundrum-Balally Dublin A267078011/02 c0 pension SP 4570.55 2004-10-03
3: C010000308 Wexford County Enniscorthy Rural South-East (IE) A247045009 c0 protection RP 602.66 2003-12-17
4: C010000312 Meath County Julianstown Mid-East A167038017 c5 protection RP 346.96 2005-07-27
5: C010000509 Monaghan County Monaghan Rural Border A177058021 c2 protection RP 120.00 2006-12-14
---
999996: C099999668 Dublin City Terenure C Dublin A268146002 c4 arf SP 19798.48 1997-04-24
999997: C099999751 Clare County Ballyglass Mid-West A037009015 c5 protection RP 371.07 2007-06-18
999998: C099999915 Clare County Rath Mid-West A037136001 c2 protection RP 240.00 2003-02-16
999999: C099999942 Longford County Meathas Truim Midland A137029001 c2 protection RP 1469.74 2001-11-11
1000000: C099999958 Meath County Donaghmore Mid-East A167025032 c5 protection RP 2325.90 2009-05-20
policy_enddate policy_duration dob_life1 gender_life1 smoker_life1 isjointlife islifeonly mortgage_status lng lat
1: 2027-04-25 240 1986-01-28 M S FALSE TRUE MORTFULL -6.24876 53.3505
2: 2096-05-17 NA 1976-05-17 M S NA NA NA -6.21507 53.2763
3: 2023-12-17 240 1958-03-05 M S TRUE TRUE MORTDECR -6.59218 52.5122
4: 2025-07-27 240 1978-09-23 M N FALSE TRUE MORTDECR -6.25455 53.7000
5: 2026-12-14 240 1960-09-01 F N FALSE FALSE TERM -6.95886 54.2591
---
999996: 2058-02-03 NA 1938-02-03 F N NA NA NA -6.29949 53.3103
999997: 2027-06-18 240 1972-07-20 F N FALSE FALSE MORTDECR -8.59987 52.6863
999998: 2018-02-16 180 1962-10-05 M N FALSE TRUE MORTDECR -9.09513 52.9568
999999: 2021-11-11 240 1970-08-05 F N FALSE TRUE MORTFULL -7.60612 53.7061
1000000: 2029-05-20 240 1975-05-27 M S FALSE TRUE MORTDECR -6.38120 53.5130
```

With a little more work, we can also bolt on additional information should we require it, such as policy lapse data. This data could then feed various survival models we use to model policy lapse and other various pieces of work we have done.

# Data Validation and Verification

An important but often-overlooked aspect of this process is checking that the data generated makes sense. If we end up with a 'fake' dataset that does not look realistic, it is of no use to anyone.

The diagnostics and other checks we choose to run are arbitrary, but much like the modelling process, simple, high level visualisations are sufficient for the moment. Once we are happy with that data we can start using it for whatever purposes we need it and then ensure that the outputs make sense for that. An iterative approach is probably best.^{5}

I will not go into a lot of detail for this, but will perform a few diagnostics and leave the rest to the imagination of the reader. There are no hard and fast rules, I would suggest first checking the columns that you are particularly interested in.

### Column: Policy Startdate (`policy_startdate`

)

We first look at the years for the `policy_startdate`

, creating a histogram and checking that what we see is as expected.

```
qplot(format(policy_startdate, "%Y"), data = policy_dt, geom = 'histogram', xlab = 'Policy Start Year', ylab = 'Policy Count')
```

One thing does stand out to me: I am not sure my choice of using a Poisson process for the policy age is the right one.

For simplicity, we already discussed that we are ignoring whether or not the policy is in-force, and so this seems flawed. Were we only looking at in-force policies we might believe that we see a bell-curve, but absent of that, it is probably better to use some kind of uniform distribution with the mean age somewhere around the middle of the range. That way, we can assume that new business is added in a roughly consistent way.

For further levels of detail, it may be desirable to have different distributions for different periods of time, allowing for macroeconomic changes in the business environment likely to affect the levels of new business.

We also will check the distribution of months in the column:

```
qplot(format(policy_startdate, "%m"), data = policy_dt, geom = 'histogram', xlab = 'Policy Start Month', ylab = 'Policy Count')
```

The distribution of months seems uniform, so that is as expected.

### Column: Policy Premium APE (`prem_ape`

)

Annual Premium Equivalent (APE) is a standard measure of premium size in a policy. For regular premium (RP) policies it is the total annual premium paid, and is 10% of the amount paid for a single premium (SP) policy:

```
qplot(prem_type, prem_ape, data = policy_dt, geom = 'boxplot') + facet_grid(prod_type ~ ., scales = 'free') + coord_flip() + scale_y_continuous(label = dollar);
```

The long tail is plain to see in this plot, and the faceting on the plots suggests that we have managed to capture the different mixtures of premia across the different products. It is worth remembering the naive method used to generate these fake premia values: real data is likely to have subtleties not (yet) included here.

# Summary

So, using 150 lines or so of R code, we can produce a book of policies of arbitrary length. This data is in standard format that looks pretty reasonable and it is at a point where we can start using it in our other work.

Whilst relatively naive and blunt in terms of the data generation, the hierarchical nature of our generation process means we already have some depth in our data, and it lends itself well to extending this generation process to allow for more sophisticated work. To truly to be confident in the output we will need to use it, and iterate our processes based on the outputs of the uses of the data.

In many cases, we find businesses do not have the technical capability nor the legal framework to allow data to leave their network in any form; as a verbatim copy, in aggregated form or even fully anonymised form. This conservative approach is certainly safe in the near-term, but looking into the future, it can be self-defeating. Competitors more willing to navigate the technical and legal frameworks are better placed to make use of scalable cloud-based infrastructure (e.g. Amazon Web Services or Microsoft Azure) and out-compete via a massively increased capability for advanced data analysis. ↩

The main post image above is also taken from Blazing Saddles and their fake town of Rockridge, which helpfully fools the bad guys. ↩

As much as possible I try to avoid making copies of large data.table structures in memory in R, so to facilitate that I often take advantage of

*lexical scoping*in R. By embedding function definitions inside other function definitions, you can use scoping and closures to avoid making copies of your data. The two function definitions**sample_cluster_data()**and**sample_product_data()**are defined inside the definition of**create_policy_data()**. This allows us to access data objects created during the flow control of**create_policy_data()**from the other two functions without requiring the overhead of passing the objects as parameters. ↩A lognormal distribution would also work, but I figured a Gamma distribution is better at capturing the longer tail. ↩

Isn't it always? ↩