How to automate your way out of Excel hell & other ONS data wrangling stories (business demography edition)

code
ons
firms
Author

Dan Olner

Published

November 5, 2024

I’ve been analysing the latest ONS business demography data (that ONS pull from the IDBR). It contains a tonne of great data on business births, deaths, numbers, ‘high growth’ firms, survival rates, down to local authority level (though sadly sector breakdowns only at national level).

Getting data out of Excel documents can be a bit extremely horrible [noting, to be clear, that Excel docs like this are super useful for many people, but just nasty for those of us who want to get the data into something like R or Python, so…]. In this case, what we’ve got is this:

So if we want consistent data across all time periods, we’ve got a couple of things to do:

  1. Get the data out of each set of four sheets into one;
  2. Harmonise the geographies so datapoints are consistent.

Luckily, the LA changes have all been to combine into larger units over time (usually unitary authorities) - so all earlier LAs fit neatly into later boundaries. Phew. This means values from earlier LAs can be summed to the larger/later ones - backcasting 2022 boundaries through all previous data.

Some anonymous angel/angels made this Wikipedia page clearly laying out when and what local authorities combined into larger unitary ones in recent years. Using that, we can piece together the changes to get to this function that does the harmonising. It groups previous LAs - that only needs to backcast 2021/2022 names once, no faffing around with each separate sheet - and then summarises counts for those new groups, for previous years’ data.

Prior to that, though, we need to pull the sheets into R. There are a lot of sheets - doing this manually would be baaad…

(Image stolen from here).