I’ve been analysing the latest ONS business demography data (that ONS pull from the IDBR). It contains a tonne of great data on business births, deaths, numbers, ‘high growth’ firms, survival rates, down to local authority level (though sadly sector breakdowns only at national level).
- My working report from that is here - hoping to add more
- Prep code is here
- Quarto code for the report is here
Getting data out of Excel documents can be a bit extremely horrible [noting, to be clear, that Excel docs like this are super useful for many people, but just nasty for those of us who want to get the data into something like R or Python, so…]. In this case, what we’ve got is this:
- For each type of data (firm births, deaths, active count, high growth count etc) there are four sheets covering different time periods, with two spanning two years and two with a single year. Why? That’s unclear until you check the geographies - the local authorities (LAs) used don’t match across sheets. Why? Because the boundaries changed, so there’s a different sheet each year they’ve changed.
So if we want consistent data across all time periods, we’ve got a couple of things to do:
- Get the data out of each set of four sheets into one;
- Harmonise the geographies so datapoints are consistent.
Luckily, the LA changes have all been to combine into larger units over time (usually unitary authorities) - so all earlier LAs fit neatly into later boundaries. Phew. This means values from earlier LAs can be summed to the larger/later ones - backcasting 2022 boundaries through all previous data.
Some anonymous angel/angels made this Wikipedia page clearly laying out when and what local authorities combined into larger unitary ones in recent years. Using that, we can piece together the changes to get to this function that does the harmonising. It groups previous LAs - that only needs to backcast 2021/2022 names once, no faffing around with each separate sheet - and then summarises counts for those new groups, for previous years’ data.
Prior to that, though, we need to pull the sheets into R. There are a lot of sheets - doing this manually would be baaad…
- The readxl package to the rescue! Part of the tidyverse, it can be used to automate extracting data from any sheet and set of cells in an Excel document. I do that in the function here, specifically for pulling out the correct cells from the ONS demography Excel. That’s used in the code here.
(Image stolen from here).