Generating Fake Data for Fun and Profit

Real data is often unavailable for creating demos, learning and especially for publishing. Here we describe methods to generate realistic artificial data which has fewer constraints.

Data protection and data privacy concerns are crucial in the modern world, rightly requiring that enterprises owning and processing data about customers and the public adhere to legal and also ethical constraints. Many enterprises find it complex and time consuming to fine-tune their data storage and processing protocols, and so play it safe by enforcing very conservative rules about when and where the data can be accessed.1

A lot of our work at Applied AI is focussed on the insurance industry where customer data is especially sensitive - often including financial, medical and demographic information. The constraints surrounding such data are of course important, but have the side-effect of making it tricky to create realistic demos and present new analytical concepts since the models & analyses need data.

All is not lost. This obstacle is more a Le Petomane Thruway2 than the Himalayas since with modern computing it is easy to generate fake data.

Generating fake data (aka synthesising realistic, but information-free data) is something we have discussed before and is a very good method of becoming familiar with new analysis techniques since you can check if your methods are capable of recovering the parameters you used to create the data. This assumes your data generation is the reverse of the modelling process which is not always the case, and we will discuss this further later in this post.


Creating a Synthetic Book of Life Insurance Policies

Our aim is to create a book of fake policy data that at least superficially resembles the policies owned by an average, medium-sized life insurance company. We could do it the simple way: figure out some basic features that are common to policies such as gender, smoking status, date-of-birth and policy start date, but we also want to attach some geospatial data to our policies in a meaningful way.

To make this even more realistic, lets pick a country - Ireland - and tailor the synthetic data to the the socioeconomic geography there. It means we build conditional dependencies into the data generation which is more involved than just randomly creating data, but is tractable.

We will use the outputs of some of our prior work creating socioeconomic clusters of Irish society based on data from the 2011 census - data provided by the Irish Central Statistics Office (CSO). The census asks a wide range of questions, per household, about the backgrounds, educations, jobs and daily lives of the occupants. After collection, processing and basic analysis, the CSO makes this data publicly available, albeit the household information is aggregated geographically into 18,000 'small areas'. A 'small area' consists of approx. 400 neighbouring households. Our prior work uses unsupervised machine learning to automatically cluster these small areas into larger groups by similarity in the census answers. Thus by knowing a customer's location, we can assign them to a group and make a reasoned estimate of their socioeconomic profile. Aggregating further, we can group customers by policy type and start making statements about the socioeconomic profiles of policy types. We can then start generating data for that policy.

Policy Types

Most insurance companies offer a wide variety of different products to their customers, but we will focus on four main types: protection policies, pension policies, savings policies and alternative retirement funds. In all four cases, there is a lot of variation in products within a type, but we ignore those distinctions in this project.

  1. Protection: The standard life protection policy. The policy holder pays a fixed number of regular premia and in return, the holder's dependants receive an agreed amount, the Sum Insured, if the policy holder dies within the lifetime of the policy
  2. Pension: Pension products are investment products aimed at saving for retirement. Generally treated beneficially by the tax authorities, they are otherwise similar to Savings policies
  3. Savings: Saving policies are investment products used by people expecting to earn a return on their investment. Usually less restricted in terms of choice of how the money is invested, they are otherwise very similar to Pension policies
  4. Alternative Retirement Funds: Purchased by retired individuals as an alternative to buying an annuity, ARFs are funds of capital where the owner draws down a certain amount of the fund each year but the rest of money remains invested, allow for further capital growth. These products are predominately owned by older, more affluent clients

Initial Policy Generation

As we discussed, we will generate this data in sequential steps:

  1. We decide the overall balance of cluster types across the book, and then proportionately allocate a cluster id to each policy
  2. We then randomly allocate each policy to a 'small area' according to a matching cluster type. If desired, we can use the geospatial data supplied by the census to randomly choose a lat/long coordinate for the policy, but this slows down the data generation considerably, and may be omitted
  3. We then also assign socioeconomics to each policy according to their cluster id.

The random allocation is weighted by population size; each small area has population statistics, so the sampling is weighted by that size. This is somewhat of a simplification as sampling by household count is a better mapping as policies are generally sold by household.

It is possible to calculate these values from the census data but for simplicity we will sample by population counts.

Each cluster has its own distribution of policy types. This is the distribution of the proportion of policy types in the book conditional that the policy is owned by household within the given cluster, and is unique to each cluster. We sample the policy type for each policy from this conditional distribution.

 > cluster_product_mapping_dt[cluster_level == 'n6',
       .(cluster_id, protection, pension, savings, arf)]
    cluster_id protection pension savings    arf
 1:         c0       0.59    0.25    0.15   0.01
 2:         c1       0.69    0.20    0.10   0.01
 3:         c2       0.59    0.10    0.30   0.01
 4:         c3       0.79    0.10    0.10   0.01
 5:         c4       0.20    0.40    0.30   0.10
 6:         c5       0.70    0.10    0.15   0.05

Generate the Policy Data

Once we have the policy type, we can then use the selection to help generate other data for the policy, such as:

  • policy start date
  • policy premium aka 'annual premium equivalent' (APE)
  • policy is single-coverage or joint-life
  • customer date of birth (based on a sampled age of the holder at the start of the policy)
  • customer gender
  • customer smoking status
  • ... and so on.
 > product_data_dt
     prod_type age_min age_max age_mean age_sd prem_sp_prop prem_multiplier prem_min
 1:    savings      18      90       40     10          0.2               5      600
 2: protection      16      70       40     10          0.0               1      120
 3:    pension      18      65       40     10          0.3               5      600
 4:        arf      50      90       70     10          1.0              50     1000

The Data Generation Routine

I have wrapped most of this logic into a single function that allows an arbitrary number of policies be created. This code is currently stored in a BitBucket repository, so if you are interested in some of the finer details of this process, we can make access available upon request.

I will not detail the full routine create_policy_data() but there are a few interesting snippets from it worth discussing.3

There is nothing complex in the code about any of the random data generated: the age of the policy holder is generated via a normal distribution using parameters set by the policy type, and the policy premium is drawn from a gamma distribution4.

The age of the policy is drawn from a Poisson distribution with a given mean age (10 years seems like a reasonable default). Having set the year, we then randomly choose a day within that year to get a spread of dates.

First, given a simple proportional sampling for each cluster from the proportions we have set, we then want to sample a small area for that cluster weighted by population size. To do that, we use the total2011 as our weighting values. We could just use the totals as weights directly in sample(). That should work, but there are often thousands of small areas in each cluster, so that gave me a bad feeling.

Instead, I took a cumulative sum of the distribution, made a random draw from the uniform distribution $(0, N_{tot})$ and then found the index of the small area corresponding to the index of the cumulative sum vector that first exceeds the random draw. The output of small areas should then be approximately distributed according to the population size.

Once we have the small area, we then look up the proportions for that cluster type and draw the product type for this policy.

    #### Then we determine the small area of the policy holder
    sample_cluster_data <- function(iter_id) {
        ### First we sample from the appropriate small areas to get some data
        N <- length(cluster_id[cluster_id == iter_id]);
        use_dt <- smallarea_dt[cluster_id %in% iter_id];

        pop_cumsum <- cumsum(use_dt$total2011);

        sample_val <- sample(1:pop_cumsum[dim(use_dt)[1]], N, replace = TRUE);

        idx <- sapply(1:N, function(iter) { length(pop_cumsum[sample_val[iter] <= pop_cumsum]) })

        sa_id <- use_dt[idx]$sa_id;


        ### Now we use the product data to calculate the policy type
        use_dt <- cluster_prod_dt[cluster_id %in% iter_id];
        prod_type_values <- use_dt$variable;
        prod_type_prop   <- use_dt$value;

        prod_type <- sample(prod_type_values, N, prob = prod_type_prop, replace = TRUE);

        cluster_data_dt <- data.table(cluster_id = iter_id, sa_id = sa_id, prod_type = prod_type, idx = idx);

        return(cluster_data_dt);
    }

To create some individual data for the policy, we look up the aggregated data we have set for each product type. We then use these parameters as inputs to various distributions for data we will need. For example, the age of the policy holder at the start of a policy is set by product type, and we use a normal distribution for these ages. This is largely a question of choice, but it seems intuitive that ages of policy holders for a given type would be normally distributed.

For the policy premium it is our experience that this data is often heavy tailed, so as I mentioned before, I chose to use a Gamma distribution. It is possible using this data will lead me to amend or tweak the way I do this random generation.

    ### We now create some policy data
    sample_product_data <- function(iter_type) {
        use_dt <- product_data_dt[prod_type == iter_type];

        N <- dim(policy_dt[prod_type == iter_type])[1];

        policy_start_age <- round(rnorm(N, use_dt$age_mean, use_dt$age_sd), 0);
        policy_start_age <- pmax(use_dt$age_min, policy_start_age);
        policy_start_age <- pmin(use_dt$age_max, policy_start_age);

        prem_ape <- rgamma(N, shape = premape_shape, scale = premape_scale) * use_dt$prem_multiplier;
        prem_ape <- round(prem_ape, 2);

        prem_ape <- pmax(use_dt$prem_min, prem_ape);

        rpsp_prop <- use_dt[, c(1 - prem_sp_prop, prem_sp_prop)];
        prem_type <- sample(c("RP", "SP"), N, prob = rpsp_prop, replace = TRUE);

        prod_data_dt <- data.table(prod_type       = iter_type
                                 , prem_type       = prem_type
                                 , policy_startage = policy_start_age
                                 , prem_ape        = prem_ape
                                   );
        return(prod_data_dt);
    }

Having determined the age of the policy holder at the start of the policy (using a Poisson distribution for a given mean), along with a random start date for the policy we can then determine the date of birth of the policy holder by subtracting the ages from policy_startdate. This seems a little convoluted, but it means the input data is much more intuitive: separating the age when the policy holder bought/started the policy, and the age at the time of the data snapshot. The age at policy start is inevitably correlated with product type.

The date arithmetic looks like:

    ### Create some datetime values
    policy_dt[, policy_age := rpois(.N, policyage_mean)];

    day_modifier <- sample(1:365, N, replace = TRUE) - 1;
    policy_dt[, policy_startdate := data_date - (365 * policy_age)];
    policy_dt[, policy_startdate := policy_startdate - day_modifier];


    day_modifier <- sample(1:365, N, replace = TRUE) - 1;
    policy_dt[, dob_life1 := policy_startdate - (365 * policy_startage)];
    policy_dt[, dob_life1 := dob_life1 - day_modifier];

Finally, we may with to convert some of the character vectors to factors to help reduce the memory usage. If memory is not a big problem we can skip this.

At the end of this we have an example book of policies that we can use as inputs to some visualisation apps we have been working on.

Sample output of the routine is shown below:

 > policy_dt
           policy_id             countyname            edname       nuts3name         sa_id cluster_id  prod_type prem_type prem_ape policy_startdate
       1: C010000082            Dublin City      North Dock C          Dublin    A268109008      c0 protection        RP  1331.92       2007-04-25
       2: C010000152 Dún Laoghaire-Rathdown   Dundrum-Balally          Dublin A267078011/02      c0    pension        SP  4570.55       2004-10-03
       3: C010000308         Wexford County Enniscorthy Rural South-East (IE)    A247045009      c0 protection        RP   602.66       2003-12-17
       4: C010000312           Meath County       Julianstown        Mid-East    A167038017      c5 protection        RP   346.96       2005-07-27
       5: C010000509        Monaghan County    Monaghan Rural          Border    A177058021      c2 protection        RP   120.00       2006-12-14
      ---
  999996: C099999668            Dublin City        Terenure C          Dublin    A268146002      c4        arf        SP 19798.48       1997-04-24
  999997: C099999751           Clare County        Ballyglass        Mid-West    A037009015      c5 protection        RP   371.07       2007-06-18
  999998: C099999915           Clare County              Rath        Mid-West    A037136001      c2 protection        RP   240.00       2003-02-16
  999999: C099999942        Longford County     Meathas Truim         Midland    A137029001      c2 protection        RP  1469.74       2001-11-11
 1000000: C099999958           Meath County        Donaghmore        Mid-East    A167025032      c5 protection        RP  2325.90       2009-05-20
          policy_enddate policy_duration  dob_life1 gender_life1 smoker_life1 isjointlife islifeonly mortgage_status      lng     lat
       1:     2027-04-25             240 1986-01-28            M            S       FALSE       TRUE        MORTFULL -6.24876 53.3505
       2:     2096-05-17              NA 1976-05-17            M            S          NA         NA              NA -6.21507 53.2763
       3:     2023-12-17             240 1958-03-05            M            S        TRUE       TRUE        MORTDECR -6.59218 52.5122
       4:     2025-07-27             240 1978-09-23            M            N       FALSE       TRUE        MORTDECR -6.25455 53.7000
       5:     2026-12-14             240 1960-09-01            F            N       FALSE      FALSE            TERM -6.95886 54.2591
      ---
  999996:     2058-02-03              NA 1938-02-03            F            N          NA         NA              NA -6.29949 53.3103
  999997:     2027-06-18             240 1972-07-20            F            N       FALSE      FALSE        MORTDECR -8.59987 52.6863
  999998:     2018-02-16             180 1962-10-05            M            N       FALSE       TRUE        MORTDECR -9.09513 52.9568
  999999:     2021-11-11             240 1970-08-05            F            N       FALSE       TRUE        MORTFULL -7.60612 53.7061
 1000000:     2029-05-20             240 1975-05-27            M            S       FALSE       TRUE        MORTDECR -6.38120 53.5130

With a little more work, we can also bolt on additional information should we require it, such as policy lapse data. This data could then feed various survival models we use to model policy lapse and other various pieces of work we have done.


Data Validation and Verification

An important but often-overlooked aspect of this process is checking that the data generated makes sense. If we end up with a 'fake' dataset that does not look realistic, it is of no use to anyone.

The diagnostics and other checks we choose to run are arbitrary, but much like the modelling process, simple, high level visualisations are sufficient for the moment. Once we are happy with that data we can start using it for whatever purposes we need it and then ensure that the outputs make sense for that. An iterative approach is probably best.5

I will not go into a lot of detail for this, but will perform a few diagnostics and leave the rest to the imagination of the reader. There are no hard and fast rules, I would suggest first checking the columns that you are particularly interested in.

Column: Policy Startdate (policy_startdate)

We first look at the years for the policy_startdate, creating a histogram and checking that what we see is as expected.

qplot(format(policy_startdate, "%Y"), data = policy_dt, geom = 'histogram', xlab = 'Policy Start Year', ylab = 'Policy Count')  

policy_year_plot.png

One thing does stand out to me: I am not sure my choice of using a Poisson process for the policy age is the right one.

For simplicity, we already discussed that we are ignoring whether or not the policy is in-force, and so this seems flawed. Were we only looking at in-force policies we might believe that we see a bell-curve, but absent of that, it is probably better to use some kind of uniform distribution with the mean age somewhere around the middle of the range. That way, we can assume that new business is added in a roughly consistent way.

For further levels of detail, it may be desirable to have different distributions for different periods of time, allowing for macroeconomic changes in the business environment likely to affect the levels of new business.

We also will check the distribution of months in the column:

qplot(format(policy_startdate, "%m"), data = policy_dt, geom = 'histogram', xlab = 'Policy Start Month', ylab = 'Policy Count')  

policy_month_plot.png

The distribution of months seems uniform, so that is as expected.

Column: Policy Premium APE (prem_ape)

Annual Premium Equivalent (APE) is a standard measure of premium size in a policy. For regular premium (RP) policies it is the total annual premium paid, and is 10% of the amount paid for a single premium (SP) policy:

qplot(prem_type, prem_ape, data = policy_dt, geom = 'boxplot') + facet_grid(prod_type ~ ., scales = 'free') + coord_flip() + scale_y_continuous(label = dollar);  

premape_boxplot.png

The long tail is plain to see in this plot, and the faceting on the plots suggests that we have managed to capture the different mixtures of premia across the different products. It is worth remembering the naive method used to generate these fake premia values: real data is likely to have subtleties not (yet) included here.


Summary

So, using 150 lines or so of R code, we can produce a book of policies of arbitrary length. This data is in standard format that looks pretty reasonable and it is at a point where we can start using it in our other work.

Whilst relatively naive and blunt in terms of the data generation, the hierarchical nature of our generation process means we already have some depth in our data, and it lends itself well to extending this generation process to allow for more sophisticated work. To truly to be confident in the output we will need to use it, and iterate our processes based on the outputs of the uses of the data.



  1. In many cases, we find businesses do not have the technical capability nor the legal framework to allow data to leave their network in any form; as a verbatim copy, in aggregated form or even fully anonymised form. This conservative approach is certainly safe in the near-term, but looking into the future, it can be self-defeating. Competitors more willing to navigate the technical and legal frameworks are better placed to make use of scalable cloud-based infrastructure (e.g. Amazon Web Services or Microsoft Azure) and out-compete via a massively increased capability for advanced data analysis.

  2. The main post image above is also taken from Blazing Saddles and their fake town of Rockridge, which helpfully fools the bad guys.

  3. As much as possible I try to avoid making copies of large data.table structures in memory in R, so to facilitate that I often take advantage of lexical scoping in R. By embedding function definitions inside other function definitions, you can use scoping and closures to avoid making copies of your data. The two function definitions sample_cluster_data() and sample_product_data() are defined inside the definition of create_policy_data(). This allows us to access data objects created during the flow control of create_policy_data() from the other two functions without requiring the overhead of passing the objects as parameters.

  4. A lognormal distribution would also work, but I figured a Gamma distribution is better at capturing the longer tail.

  5. Isn't it always?

Mick Cooney

Mick is highly experienced in probabilistic programming, high performance computing and financial modelling for derivatives analysis and volatility trading.