Geocoding Postal Addresses for Improved Statistical Analyses

Converting postal addresses into geospatial lat/lon coordinates - aka geocoding - is cheaper and more accessible than you might imagine, and enables powerful statistical analyses.

Almost every consumer and product-based business will have a database of home and/or business postal addresses for customers, suppliers, employees etc. and they may even use addresses within their software products. Addresses are important for humans, but are a poorly designed feature of our communities and shared histories. Freeform in nature, inconsistent, and often flat-out wrong, it's hard to use addresses in statistical modelling beyond e.g. simple parsing to extract various regional labels. Even then, such work is fraught with headaches due to typos, incorrect values and weird structuring.

There's actually many geocoding systems for addresses depending on usage, including political administration taxonomies, airports, telephony, geodesy and postal systems. We are currently most interested in converting those latter postal addresses into latitude and longitude coordinates - a process known as 'geocoding'.

Geocoded locations are amenable to data analysis

A lat/long co-ordinate can be placed on a grid, and allows all sorts of spatial analysis. To take a few simple examples of helping business lines in insurance and reinsurance, geocoded addresses are useful for:

  • Property Flood Insurance: If you know the precise location of the properties insured, you could also get topographical data and combine both to determine the likelihood of flooding in the event of bad weather or burst riverbanks.1

  • Healthcare Claims: In critical health situations, time to treatment is one of the most important determinants of a successful outcome. Suppose you are insuring or reinsuring professional indemnification claims at various healthcare institutions. Knowing the distance of each institution to another, along with a list of services at each location could be very important in managing that risk and potentially improving response policies for those institutions and individuals you insure.

  • Marketing New Products: Location-based marketing is common in the retail, telco and pharma industries: directing marketing efforts towards the physical locations where your target customer (according to socio-economic indicators) tends to reside. We recently created such a service for the Republic of Ireland (ROI) based on the 2011 Census and related data, and will likely talk about this in a future blogpost.

Applied AI Area Knowledge App


Let's run through an example using high quality open-source geocoding software

Geocoding software has been around for long time and several commercial offerings exist for online & offline, singular & batch processing. Google has a free geocoding API you can easily try in this online demo, although it's limited to 5 requests /sec. and 2,500 total requests per 24hr period, enough for a trial but too limited for real usage.

OpenStreetMap also has a geocoding service, and for bulk amounts they offer a free, open-source system called Nominatim that is the subject of this article.

Getting Started

Nominatim is available at http://wiki.openstreetmap.org/wiki/Nominatim, with thorough documentation covering installation, system requirements, API references etc.

Despite listing somewhat onerous hard disk and memory requirements, since I was only working on geocoding addresses in Ireland, I installed a version on a local virtual machine with 2GB of RAM and about 80GB of storage space. I even managed to squeeze an installation onto a free tier machine on Amazon Web Services (AWS).2

Nominatim runs off a platform of PHP and PostgreSQL with the PostGIS extension, and I used Apache as the webserver (instructions for using nginx are also provided). I am not going to go into the installation details here, they are all on the website and are pretty easy to follow. As always, be ready for a few false starts and do not be afraid to start again if you need to.

As an aside, my focus on this work was entirely for addresses located within ROI and as such postal addresses do not have postcodes - making geocoding more difficult. The instructions for Nominatim does make mention of additional data that can be added to the system to improve the outputs, such as the use of postcodes in Great Britain and TIGER address data for the US.

Forward Geocoding

For the purposes of this article I will assume that Nominatim is available at the root url http://localhost/nominatim but this is easily configured to use other domain names and can be accessed over the internet if you wish.

To geocode an address, known as forward geocoding, you simply request it as follows:

http://localhost/nominatim/search.php?q=[mypostaladdress]&format=json

... and Nominatim will return a JSON-formatted response with various details of the address you passed (if found), the most important of which is the latitude and longitude. If there are multiple results, Nominatim returns each result as a separate element in a JSON array and will probably require further parsing.

It is also worth noting here that there does not appear to be standard form of output from geocoding systems. Google's Geocoder API returns quite a different JSON structure, so if you plan to use multiple geocoding systems, you'll need different handlers for the output.

If the attempt was not successful, meaning the geocoder failed to match the address, Nominatim returns an empty JSON array $[]$. To quote an example given on the Nominatim site, try the postal address:

http://nominatim.openstreetmap.org/search?q=135+pilkington+avenue,+birmingham&format=json

The formatted3 return string looks like:

{
    "place_id":"73723099",
    "licence":"Data © OpenStreetMap contributors, ODbL 1.0. http:\/\/www.openstreetmap.org\/copyright",
    "osm_type":"way",
    "osm_id":"90394480",
    "boundingbox":[
      "52.5487473",
      "52.5488481",
      "-1.8165129",
      "-1.8163463"
    ],
    "lat":"52.5487921",
    "lon":"-1.8164307339635",
    "display_name":"135, Pilkington Avenue, Castle Vale, Maney, Birmingham, West Midlands, England, B72 1LH, United Kingdom",
    "class":"building",
    "type":"yes",
    "importance":0.301
  }

As you can see from the JSON, pulling out the latitude lat and longitude lon is just a matter of parsing the JSON object.

Reverse Geocoding

Nominatim also provides the ability to reverse-geocode converting lat/lon coordinates to a postal address. This can be very useful, for example, to convert GPS coordinates and areas into human-readable addresses and/or landmarks.

Again, using examples provided in the Nominatim documentation, we perform the following request:

http://nominatim.openstreetmap.org/reverse?format=json&lat=52.5487429714954&lon=-1.81602098644987

and get the result:

{
    "place_id":"73626440",
    "licence":"Data © OpenStreetMap contributors, ODbL 1.0. http:\/\/www.openstreetmap.org\/copyright",
    "osm_type":"way",
    "osm_id":"90394420",
    "lat":"52.54877605",
    "lon":"-1.81627023283164",
    "display_name":"137, Pilkington Avenue, Castle Vale, Maney, Birmingham, West Midlands, England, B72 1LH, United Kingdom",
    "address":{
        "house_number":"137",
        "road":"Pilkington Avenue",
        "suburb":"Castle Vale",
        "hamlet":"Maney",
        "city":"Birmingham",
        "state_district":"West Midlands",
        "state":"England",
        "postcode":"B72 1LH",
        "country":"United Kingdom",
        "country_code":"gb"
    }
}

We would then parse this response to obtain the postal address.


Geocoding the Property Price Register

After installation and several hours of processing - even for the relatively small Irish data set - I had a geocoder that responded to queries and returned good results. All I need now is a set of addresses from somewhere to geocode...

As mentioned before, one of the big issues with handling address data is that it is often a mess, and I wanted a set of addresses that were not totally pathological so I could properly assess how good Nominatim is, and what sort of coverage we could expect from a given set of addresses.

Another potential concern is the likelihood of false results, an issue I experienced with Google's geocoder which gave the wrong result from time to time: matching words from different parts of the postal address resulting in coordinates on the wrong side of Dublin City. Dublin addresses appeared to be more prone to this, presumably due to there being more words in the addresses; words in estate names getting matched with street names for example.

Since these false positive become quite a pain to catch after the fact, I would prefer Nominatim to try, but not too hard. At least a blank answer is a known negative result, much less likely to interfere with subsequent analysis. Thankfully, this is what seemed to be the case, but more on that later.

So, to get a reasonable sized list of addresses, I discovered that the Irish government has uploaded all property transactions since 2010 online, and available to the public. To access the data, visit

https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/page/ppr-home-en

There is a section that asks you to verify the T&Cs and then provides a menu to download the data - one dataset for each year.

I used the Property Price Register as most addresses are likely to be reasonably well formatted since they are part of a legal transfer of deed. Ambiguity is not good in this situation. Also, property prices are never far from the mind, so doing something with this might be of interest for future work.

For now though, the files give us a CSV file for each year, and it is encoded using cp1252 so we can load all this data into a single data.table (if using R) or pandas DataFrame (if using Python) and then geocode each address in the data.table. We can then save the resulting JSON as an additional column in the table for future use.

I will not show any code for this, since it's very simple to do. Instead, I will simply point you towards the direction I went and offer any help I can give if you have issues.

One tricky part is that it is often worth trying multiple requests for each address by using a few different combinations of parts and iterating till a non-blank response is obtained, or all attempts fail. Again, it is important not too try too hard - better a blank than something wrong or totally vague and useless.

Just to show the logic, suppose we are trying to geocode the following (utterly fictitious) address: 75 Tennis Street, Blanchardstown, Co.Dublin

One method might involve trying to geocode the following in order:

75 Tennis Street, Blanchardstown, Co.Dublin
Tennis Street, Blanchardstown, Co.Dublin
Blanchardstown, Co.Dublin

Already the above is problematic, Blanchardstown is a big area, and while it will probably return a value, it may not be what you are looking for. It is likely you will need additional logic to keep an address composed of at least three pieces in the address, each piece being segmented by a ',' character.

As you can probably already see, doing analytical work with postal addresses comes packaged with weird gotchas at no extra charge, even when well formatted!

UPDATE 2015-07-13: After we published this, I realised I never discussed how successful we were with the addresses. The PPR dataset had roughly 150,000 addresses when I worked with it, and we had a hit rate of about 80%, so not bad!

Some of those were rural addresses, and a few might be too broad an area queried, but I was extremely happy with the final results overall. Nominatim is definitely worth exploring as an option if geocoding is something you are looking to get into. Doing this for the UK or the US could possibly be better.


Next Steps

Once we have geocoded all those addresses, what do we do with them?

Geospatial modelling has recently become extremely accessible, with plenty of open-source tools available. R and Python have quite a few libraries and resources for that type of work, and the PostGIS extension to PostgreSQL is incredibly powerful. You can now do some extremely interesting things by just plugging some different bits together.

One particular application I intend to discuss at some point is how to use the geocoded addresses we just obtained, store them in PostgreSQL and use PostGIS to label each address with a customer segmentation label developed from a clustering technique used on socio-economic data.

For now, it is less exciting, but one very simple application is to add a heatmap of all the properties to the map of Ireland. The basic idea is that each property is a source of 'heat' and the colours then show how 'warm' each part of the country is. Not much more than a toy example, yes, but it does give a small taste of the interesting modelling and visualisation possibilities geocoding gives you.

PPR Heatmap


Quick note about Eircode

This article was written and ready for publishing when we discovered that the Irish Eircode system is finally launching. This system provides a unique postcode to every property in ROI and has proven quite controversial in the design and implementation; the codes precisely identify a property, but they are non-sequential and neighbouring properties do not have neighbouring codes, preventing easy grouping and sequencing of addresses. I do not have much to say about this yet, as I am not an expert, but my initial research has left me dubious of the benefits.

I am sure this is a topic we will return to soon.


  1. Precisely evaluated flood damage protection is especially pertinent given climate change and the tendency for new estates to be built on old floodplains.

  2. It took a bit of work, especially with the memory. Contact me directly if interested.

  3. Formatting JSON to human-readable text is useful and fairly trivial; many IDEs contain formatters and there's even several online services like http://jsonformatter.curiousconcept.com/ for one-off formatting.

Mick Cooney

Mick is highly experienced in probabilistic programming, high performance computing and financial modelling for derivatives analysis and volatility trading.