Practical data science projects often include an aspect of anonymisation to carefully remove sensitive information prior to analysis; here we demonstrate several complimentary techniques and principles.

Hardly a week goes by without news reports of a security breach at a large organisation and the stealing of sensitive data, recently including Sony Entertainment, Lloyds Bank, and repeated breaches at Talk Talk. The security breaches are often not very sophisticated, and can reveal an amateurish level of security practice within the victim organisations.

In the insurance world, we saw a recent hack into US health insurer Excellus in which attackers gained access to ~10M customer records containing credit card numbers, social security numbers and worryingly, medical histories. Such cyber attacks have been estimated to cost the US health system $6B annually. Having said that, it's important that we distinguish such data security from our topic here of data anonymisation. • Data security generally preserves information content but for a limited audience only; it involves encryption, reduced attack surfaces, operational practices, limits on data-gathering etc. • Data anonymisation generally removes information content, such that the data can be seen safely by a wider audience; it includes removal, obfuscation and aggregation. Common to both data security and anonymisation is the underlying principle that the information present in data should be available only to the appropriate level of usage by the intended audience. A large proportion of the information stored about our daily personal and business activities is actually quite sensitive, including: personally identifying personal attributes commercially sensitive name gender client & partner companies home & work address ethnicity projects & products telephone number age sales & commercial rates email addresses marital status customer details social media handles education & profession employee details Clearly such information can be used to learn a lot about the behaviour of people and companies, and there are legal and ethical aspects to carefully consider before using any of it, in any form. On the technical side, we expect to deal with data security and data anonymisation on a daily basis: • usually our clients have well-prepared processes and procedures to ensure data security, but • we often find during the planning and data engineering stages of the SPEACS process that we need to anonymise data extracts in order to create analyses & models that use only the appropriate level of information. Lets discuss a few approaches: Anonymising Structured Data Assuming that the original data is contained within a well-managed database, the process of extracting data for analysis and anonymising particular fields is technically quite straightforward, but inevitably affects the downstream analysis. Feature removal, obfuscation and reduced precision (preserve records) We seek to remove, obfuscate or reduce the precision of specific fields in such a way that the original information is not part of the copied dataset. For example, imagine we want to make use of customer information in this (fictional) life policy database: table: customers uniqid | first | last | dob | gndr | smkr | mrtlstat | postcode ---------------------------------------------------------------------- aa0001 | bob | smith | 19810123 | M | N | single | EC1V 3EE af0002 | mark | jones | 19460802 | M | S | widowed | N8 7TA cx0003 | anne | clark | 19581105 | F | N | married | SW1W 9SR ...  At a minimum, good practice will remove direct identifiers - the names in first and last, and the table key uniqid. We should replace uniqid in a deterministic way on this primary key and wherever it appears as a foreign key such that we can continue to join tables. It's also good practice to reduce the precision of the postcode and dob (date of birth) fields to the minimum useful for analysis; perhaps removing the latter part of the postcode, and the day and month from the dob. This helps to prevent directed searches, and may be a simple way of making analyses more generalisable. The extracted, record-level anonymised data could look like: table: extract_customers newuniqid | yob | gndr | smkr | mrtlstat | postcode ------------------------------------------------------ zah1034c | 1981 | M | N | single | EC1V jgo3332d | 1946 | M | S | widowed | N8 nnw9298x | 1958 | F | N | married | SW1W ...  In fact, for continuous features like age, salary, sum insured, premium etc, individuals will lie somewhere on a distribution of values and if they happen to be outliers then this too can be a method of identification. For example a life policy may have a sum insured of 10MM GBP, which is likely towards the upper tail of the distribution. We could 'hide' this, and other such outliers by: • constraining the distribution to a maximum such that any values larger than, say, 5MM GBP are limited down to 5MM GBP. The downside is we lose the true distribution which may be important for modelling. • quantising the distribution into discrete steps, thus the 10MM GBP value may get placed into a categorical band between say, 5MM to 10MM. This again loses fidelity. Record aggregation (preserve attributes) Some information is sufficiently sensitive to warrant aggregating several records together in order to prevent so-called 'triangulation', whereby it's possible to identify an individual based on seemingly-unrelated attributes.1 Taking the example above, the record for Mark Jones could still potentially be identified if we know that he is a widower, born in 1946 and smokes. This combination of attributes may also be rare within the dataset, making a search for his record easier to narrow down. We previously mentioned a project involving our use of the 2011 Irish Census. The Census naturally collects some very sensitive socioeconomic data per household including ethnicity, education, employment, health etc. When the Irish Central Statistics Office (CSO) release data back to the public it is strictly in aggregated form only, providing counts summed over several hundred households and preventing triangulation. Carefully consider the analyses you wish to perform, the questions you seek to answer, and consider whether record aggregation is feasible. In our Census study, the aggregation actually helped the analysis by reducing the volume of data and smoothing out the variability of values within. The UK Data Archive also has a good guide on structured data anonymisation. Anonymising Unstructured Data (Pattern Matching) Freetext is everywhere and with today's sophisticated text mining & modelling algorithms it's very tempting to try to use text analysis to gain competitive advantages in everything from public brand engagement, to customer profiling, to recruitment, to improving corporate communications and much more. However, by it's nature, freetext can potentially contain any manner of sensitive information which we often want to remain secure and excluded from analysis. This part of the blogpost was inspired by some of our prior project work on communications analysis & customer insight. It was written in collaboration with Jonny Edwards - one of our associate data scientists - who has authored an open-source Python package called Snow which implements some of the following techniques for text anonymisation. Regular Expressions Words are just patterns of characters and therefore just patterns of bytes, and computers are great at matching bytes. The established tool for such text-based pattern matching is regular expressions (regex). The regex approach works very well when we seek to anonymise an item that obeys a conventional or rules-based structure e.g: • email address • phone number • postcode • customer ID • medical ID For example, if we wanted to remove all email addresses from a lowercased ascii document, we would need to: 1. create a generic pattern for an email address, and 2. test each word in the document against that pattern, and filter if there is a match The full email address specification is rather complex and to build a real email regex is actually quite tricky, but lets try to build a very basic pattern here to illustrate. Assume the basic email address pattern is localname@domainname.tld where: • there's always a single @ sign in the middle • the localname can be alphanumerics plus a few symbols • the domainname must start & end with an alphanumeric, may contain hyphens, and it must be less than 63 characters long • the tld (the Top Level Domain) is separated by a fullstop and similarly must start & end with alphanumerics, may contain hyphens and must be fewer than 63 characters long. The regex mini-language has a reputation for being difficult to write, but if you follow some of the many available guides you can't go too far wrong. There's even a handful of online tools like RegExr and regex101 which seem especially handy for one-off testing outside of your codebase. Accordingly, this very basic (and undoubtedly insufficient) email address regex would contain: • localname: [a-z0-9!#$%&'*+-/=?^_{}|~]+
• @: @
• domainname: [a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?
• tld: (\.[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?)?

Python has a great implementation of regex, similar to that of Perl, and so our basic email replacer code could look like:

import re

txt = 'this is a fragment of an email sent from alice@mydomain.com
to bob@yourdomain.org, which we want to anonymise'

# spec the regex patterns for each part
localname = "[a-z0-9!#$%&'*+-/=?^_{}|~]+" domainname = "[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?" tld = "(\.[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?)?" # compile the regex pattern rx_email = re.compile(localname + '@' + domainname + tld) # apply the regex to the text, replacing matches with '---' rx_email.sub('---', txt) > 'this is a fragment of an email sent from --- to ---, which we want to anonymise'  Next, lets consider how to efficiently anonymise text that doesn't necessarily obey a fixed pattern. Anonymising Unstructured Data (Specific Filtering) This is something more akin to a brute force approach: we create a 'blacklist' of words to be removed, and test each word in the document, removing it if it appears in the blacklist. Such words to exclude via a blacklist might be: • names of persons • names of places and addresses • names of companies and projects • names of websites This approach relies on well-maintained blacklists, and they can grow quite large. Alternatively, we could create a 'whitelist' of permissible-only words, excluding all words from the document which don't appear in the list. For example, our whitelist could be comprised of the N most common non-name words in the English language and could be much smaller. Hash Tables As the lists of words grows bigger, the computation required for searching becomes unworkably slow: in an unsorted list the search time scales linearly with size O(n). Even if we pre-sort the list, which itself takes some time, O(n log n), a binary search on a sorted list can be expected to take O(log n) time.2 A standard solution is to create a hash table, the most basic version of which is where the filterlist words are hashed (mapped to a constant size) and stored in memory at a location according to the hash value. Now when we test for membership of the set, all we need to do is hash the query word and look up the resulting memory location in the set. This takes constant O(1) time, i.e. the lookup time is independent of the size of the set. A downside of this approach is that the hash tables can start to get large. They store all the original information, plus an overhead to avoid hash collisions. We can demonstrate this easily in Python, using the words dataset that's present in most Unix systems: # NOTE: using Python 3.4 on Mac OSX 10.10.5 # NOTE: using pympler https://github.com/pympler/pympler from pympler import asizeof # import Unix words list of common English-language words with open('/usr/share/dict/words','r') as f: wrds = ['{}'.format(x.strip().lower()) for x in f.readlines()] # how many words in the list? print(len(wrds)) > 235886 # quick look at some of the words in list print(wrds[50::10000]) > ['abandoner', 'antidiastase', 'beerocracy', 'capriciously' , 'commissively', 'deformability', 'el', 'fingal' , 'grimalkin', 'hypomanic', 'jillflirt', 'magnesic' , 'mousetrap', 'offense', 'passeriform', 'polychromate' , 'punctuate', 'ringite', 'shellflower', 'stola' , 'tent', 'tyrannoid', 'unoratorial', 'waggishly'] # how much memory does the list consume in bytes? asizeof.asizeof(wrds) > 16797448  Our list of words consumes$16.8 MB$, which is not huge, though already quite unusable for putting into the code for, say, a webpage. What about a hash table? # how much memory does the set (hash table) consume in bytes? asizeof.asizeof(set(wrds)) > 22982240  As expected, the set (hash table) consumes more memory, in fact at$23.0MB$it has an additional$6.2MB = 37\%$more due to overhead. For a whitelist, this large memory footprint might not become too much of a problem, but for a blacklist it might grow beyond a reasonable size. Perhaps a hash table isn't enough? Lets change tack for a minute and imagine that we run a URL shortening service like Bitly or TinyURL. Feed these services with a normal URL and they issue a new, shortened URL which directs to their webservers, e.g. this link http://bit.ly/215phZe makes a web request to Bitly which then redirects to our main website at http://applied.ai. The benefit to the user is a shorter, easier to type URL (although not perhaps in the case of our website!) and the benefit to the URL shortening service is they get to directly measure the traffic to websites. According to the fail page copyrighted 2014, Bitly links are apparently 4 - 6 case-sensitive alphanumeric characters (though clearly my 2015-generated link above is 7 chars long). If our hypothetical system used the same initial scheme and allowed 6 case-sensitive alphanumerics, the potential search space for an arbitrary short URL is$(10 + (26 * 2))^6 = 57e9$items. Bitly purports to have shortened ~25 billion$(25e9)$links to date, so if our system was say, half as successful, that means$1-\frac{12e9}{57e9} \approx 79\%$of the search space would be empty. If a user types a short URL which is technically valid but has actually never been issued, this could cause a wasteful lookup in the database, costing time and energy. Much better to run the requested short URL though a filter on the webserver beforehand to test if it has been issued, and only then make the expensive call to the database to return the full normal length URL. However, a hash table with 57 billion entries is likely to have a rather large memory footprint: ascii is 1 byte per char, so$1 * 7 * 57e9 \approx 399 GB$before overhead. This is way too large to load into live memory on a webserver. Bloom Filters The URL-shortening example above is totally contrived and I've no idea if it's actually used by those companies, however, the principle is in use by several companies and products to speed up database and cache lookups. How? The trick is to use a Bloom filter, rather than a conventional hash table. A Bloom filter is a space-efficient probabilistic data structure, similar in principle to a hash table but far more compact. We supply the filter with strings and then test new strings against it. If a test string has already been added to the filter, then we are guaranteed to get a match. If a test string has never been seen before, then we may get a match - a false positive - with a tuneable probability,$\epsilon$. We can tune the Bloom filter using the formula: $$\epsilon \approx (1 - e ^{\frac{-kn}{m}})^k$$ where: n is the number of items to be stored in the filter m is the length of the bitarray k is the number of hashes each item is put through  As$n$increases, so does the probability of a false positive, since the filter becomes full and more items test positive. As$m$increases, there is more room in the filter to store hashes, so$\epsilon$reduces. There is a conventional setting for$k = \frac{m}{n} ln(2)$Numerically, we can estimate the storage space required per key as: $$1.44 log_{2}(\frac{1}{\epsilon})$$ Thus we could create a Bloom filter for our URL-shortening example above, tuned with a false positive rate$\epsilon$of 1%. This would theoretically require$1.44 log_{2}(1/0.01) \approx 9.6 bits \approx 1.2 bytes$per item, a total of$1.2 * 57e9 \approx 68 GB$of memory: only 17% of the estimated space required for an equivalent hash table. Possibly this would be more usable in day to day system operations. How do Bloom filters work? In more detail, a Bloom filter accepts an item to be stored, hashes it a certain number of times in different ways, and records the resulting hashes to an empty bitarray (full of zeros) of fixed size. The next item is also put through the same hashing and the hashes are added to the same bitarray. If the hashes partially collide with the existing bitarray, this is okay and the bitarray simply starts to become full of ones. For example, lets imagine we're really bad at shopping and need to use a Bloom filter to determine if we should buy an item. Our shopping list is for apples, bananas and carrots: import numpy as np def really_bad_demo_hasher(s): """ Terrible implementation of a hash function to produce a bitarray from utf-8 string. For explanation purposes only. """ s_ords = sum([ord(t) for t in s]) # convert string to integer s_mod = s_ords % 10 # take mod 10 s_arr = np.zeros(10) # create empty bitarray s_arr[s_mod] = 1 # set bitarray according to string return s_arr # fill Bloom filter with a short shopping list of fruit & veg bf = np.zeros(10) for item in ['apples','bananas','carrots']: item_bitarr = really_bad_demo_hasher(item) print('{} {}'.format(item_bitarr,item)) bf += item_bitarr print('\n{} {}'.format(bf,'total')) > [ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] apples [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] bananas [ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] carrots [ 0. 0. 0. 0. 1. 1. 1. 0. 0. 0.] total  Here you can see each item in the list of fruits and vegetables gets hashed once$k=1$and added to the Bloom filter bf, which starts to fill with ones. Now we'll walk around the supermarket, testing the items we find. If the Bloom filter responds with a match, then we should buy the item: def test_bloomfilter(bf,s): """ Test if string s in Bloomfilter bf """ if min( bf - really_bad_demo_hasher(s)) <0: return 'Nope, no {} today'.format(s) else: return 'Yes, {} are on the list, buy them!'.format(s) for item in ['oranges','bananas','eggplants']: print(test_bloomfilter(bf,item)) > Nope, no oranges today Yes, bananas are on the list, buy them! Yes, eggplants are on the list, buy them!  Okay, we didn't want oranges and we did want bananas, good so far. However we seem to be about to buy eggplants, which definitely wasn't on the list! This false positive is a direct result of the way that Bloom filters work, in this case the hash for 'eggplants' looks the same as that for 'apples', which is on the list: for item in ['apples','eggplants']: item_bitarr = really_bad_demo_hasher(item) print('{} {}'.format(item_bitarr,item)) > [ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] apples [ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] eggplants  The simple solution to this is to increase the size of the bitarray$m$. A more realistic example Lets say we have a large document and are concerned that it might contain the names of people who should otherwise be anonymous. We want to remove these names. I'll use a big list of first and last names which I partially created using freely available lists of names from the 1990 USA Census. Ask if you'd like access. My list contains 6046 first names, and 156360 last names, totalling 162,406 items. What memory footprint can we expect to need for a conventional Bloom filter if we require a low false positive rate$\epsilon = 0.1\%$? $$nbitsperitem \approx 1.44 * log_{2}(\frac{1}{0.001}) \approx 14.35bits \approx 1.79 bytes$$ $$1.79 * 162406 \approx 291kB$$ Lets check what we get when using the pybloom package: from pybloom import BloomFilter bf = BloomFilter(capacity=len(names), error_rate=0.001) print('Filter consumes approx {:,.0f} bytes'.format(bf.num_bits/8)) > Filter consumes approx 291,876 bytes # how big is the hash table (bytes)? asizeof.asizeof(set(names)) > 13673544  The Python pybloom implementation appears to be pretty close to the theoretical size, and is a mere fraction of the size of the hash table:$\frac{292kB}{13674kB} \approx 2.14\%$To complete the comparison, let's take a look at the runtime required to parse through some real text: import requests # get real-world text document "The Adventures of Sherlock Holmes" sherlock = requests.get('https://www.gutenberg.org/cache/epub/1661/pg1661.txt').text # initialise Bloom filter bf = BloomFilter(capacity=len(names), error_rate=0.001) print('Filter setup with {:.1e} bits, capacity for {} items'.format( bf.num_bits, bf.capacity)) > Filter setup with 2.3e+06 bits, capacity for 162406 items # populate bloomfilter and hash table _ = [bf.add(w) for w in names] setnames = set(names) # time duration for applying hash table to Sherlock text %timeit ['REM' if t.lower() in setnames else t for t in sherlock.split()] > 10 loops, best of 3: 34.8 ms per loop # time duration for applying bloomfilter on Sherlock text %timeit ['REM' if t.lower() in bf else t for t in sherlock.split()] > 1 loops, best of 3: 380 ms per loop  Here we see that the Bloom filter takes 380ms to filter the text - approx 11x slower than the hash table - likely due to the number of hash operations$k$required to test each item in the Bloom filter. Clearly there's a trade-off to make between memory footprint and execution time. For distributed data processing in particular, it may well be that Bloom filters offer a great advantage for text anonymisation. Summary There's a lot of detail in data anonymisation. • If the data is structured then we can apply certain techniques to either preserve records or preserve features, while making it hard to identify individuals • If the data is unstructured, we can apply pattern matching and filtering to remove sensitive information. All throughout, there's tradeoffs to make which affect the downstream analysis and the audience that may see such analyses. Key to all is the principle that we only ever store, access and use the appropriate level of information for the task at hand. 1. To make a very concrete example, the Electronic Freedom Foundation discovered in 2010 that approx. 8 in 10 internet browsers are uniquely identifiable amongst all other browsers in the world. Each browser has a 'fingerprint' based on the disparate combination of details given to a webserver when it requests a page. Though no single attribute is especially informative - for instance, which operating system you're using, or your language, or the screen size - they provide a lot of information when used in combination. I just tested my browser and it has a fingerprint that is apparently unique amongst the 6M tested so far. The fingerprint apparently conveys approx. 23 bits of identifying information, which technically is sufficient to distinguish between my browser and up to$2^{23} = 8.4M\$ others.

2. Such 'Big-O' notation is a standard way to indicate the running time of an algorithm, with respect to the number of items processed, n.