Dan Olner's Data Dispatch

Getting started with using R and RStudio (in the cloud or on your own computer)

Dan Olner — Thu, 15 May 2025 23:00:00 GMT

Using R and RStudio: (1) posit.cloud in the browser, or (2) running on your own computer

I’m running an “R + regional economic data” taster session in June. It won’t be necessary to use R during the session to follow along - but if you want to have a go at running the code I’ll be talking through and don’t yet have R/RStudio, here’s how to get quickly set up, either online through a browser, or with R and RStudio installed on your own computer¹.

To do this, you’ll need to do one of the following:

Use RStudio in your browser with a posit.cloud account. The free version is limited (very small memory, for instance) but it’s a very quick and easy option to have a play with R and will be fine for the taster session. The next section below talks through setting up in posit.cloud.
Install R and RStudio on your own computer. If you have a machine where you’re able to install your own software then go here to download/install the right version of R for your operating system, and here to download/install RStudio (again, pick the correct one for your OS). (Though see bullet point 2 below if using a work machine.)

The next chunk will talk through setting up a posit.cloud project.
The chunk after that talks through getting started with an RStudio project, and will be nearly the same whether you’re using RStudio online or on your machine (with just one tweak, explained in the breakout box).

Any questions/issues, let me know at d dot older at sheffield dot ac dot uk or message me on LinkedIn and I’ll try to answer.

If using RStudio online: set up a posit.cloud account and create your RStudio project

Here’s the steps to get up and running through a browser using posit.cloud.

Create an account at posit.cloud using the ‘sign up’ box, and then log in. That’ll take you to your online workspace.
Click the new project button, top right
Select ‘new project from template’ (as in the pic below) and then “Data Analysis in R with the tidyverse” (if not already selected). This template comes pre-installed with the tidyverse package, which we’ll be using. Select then click OK down at the bottom. This will open your RStudio project where we’ll do the coding.

Make a new R script and add a library

Now you should either be in RStudio online through posit.cloud or if using RStudio installed on your own computer, open that now. From here…

Create a new R script by going to ‘file > new file > new R script’ (or using the CTRL+SHIFT+N shortcut). A new script will appear, currently just called ‘Untitled1’ until it’s saved for the first time.
At this point, it should look something like this:

Let’s stop for a moment and look at the separate windows in RStudio.

Bottom left is the console. Commands run here. You can test it by clicking in the console and trying a random command or two like those below (press enter to run a command in the console).

(Note that all code blocks in this post have a little ‘copy to clipboard’ icon in the top right when you hover, if you want to just copy the code for pasting into RStudio).

2+2

[1] 4

Or e.g.

sqrt(49)

[1] 7

Bottom right of the RStudio window has various tabs, including local files (all kept inside your project folder so everything is self contained) and a list of available packages².

NOTE: IF USING RSTUDIO ON YOUR OWN COMPUTER…

We will be using the tidyverse package/library in the session. If you’re using posit.cloud, this package comes pre-installed in the template we selected. However, if you’re using RStudio on your own machine, you will need to install the tidyverse package yourself before we load it as a library.

To do this, just run the following code in the console (the same place we just did our ‘2+2’ test, in the bottom left panel in RStudio.)

install.packages('tidyverse')

You should get a confirmatory message once the package has installed successfully (though it may take a minute or two).

Now, whether in posit.cloud or on your own computer, you should have the tidyverse package available.

It now needs to be loaded as a library before we can use it:

Put the following text at the top of the newly opened R script in the top left panel.

library(tidyverse)

When you’ve put that in, the script title will go red, showing it can now be saved (it should look something like the image below).

Save the script either with the CTRL + S shortcut, or file > save. Give it whatever name you like, but note that it saves into your self-contained project folder.

Running code in an R script / loading the tidyverse library

All code will run in the console - what we do with scripts is just send our written code to the console. We do this in a couple of ways:

In your R script, if no code text is highlighted/selected, RStudio will Run the code line by line (or chunk by chunk - we’ll cover that in the taster session).
If a block of text is highlighted, the whole block will run. So e.g. if you select-all in a script and then run it, the entire script will run.

Let’s do #1: Run the code line by line.

To test this, we’ll load the tidyverse library with the code we just pasted in (which is just one line of code at the moment!) Put the cursor at the end of the libary(tidyverse) line in the script (if it’s not there already), either with the mouse or keyboard. (Keyboard navigation is mostly the same as any other text editor like Word, but here’s a full shortcut list if useful.)
Once there, either use the ‘run’ button, top right of the script window, or (much easier if you’re doing this repeatedly) press CTRL+ENTER to run it.

You should see the code get sent to the console, and a message like the one below confirming that R is ‘Attaching core tidyverse packages’. The tidyverse library is now loaded.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%()   masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package () to force all conflicts to become errors

That’s it for now! That’ll be enough to be set up for the session. Any questions/issues, let me know at d dot older at sheffield dot ac dot uk or message me on LinkedIn.

Footnotes

Installing R can be tricky on work machines if your organisation isn’t familiar with it. R needs to access the internet to install libraries, for instance, and this can sometimes hit firewall issues. If you end up having this problem, I suggest trying the online posit.cloud route for now.↩︎
You can treat the terms ‘package’ and ‘library’ as interchangeble in R, but if you want to know the reason: if packages are like books, libraries are where the books are stored - we use the same name as the package to load a library. One of many examples of R being unnecessarily confusing with its language!↩︎

Data Stewards’ Network meeting on workflows

Dan Olner — Thu, 27 Mar 2025 00:00:00 GMT

I presented at a Sheffield University Data Stewards’ Network event, talking about the pipelines I’ve been building for ONS economic data and Companies House data (slides online here). As well as evangelising about the wonders of Quarto + R for ease-of-pipeline-making (e.g. downloading / extracting all ONS data, harmonising/combining then auto-updating webpages to serve it), I talked about how open data and tools can help support analytic capacity growth in regional/local government, helping us move a bit closer to a shared sense of ground truth.

Also presenting were the excellent folks from UoS’ Urban Flows Observatory, talking about all the fun they’ve had getting an entire citywide sensor network up and running and making that accessible through their portal.

The slides have some linked interactive pages and reports - a quick list of those here:

Great Britain interactive hexmap of Companies House data I’ve been working to make accessible as well as open (currently ‘open but opaque’). Each 5km-across-hex shows the modal sector (most common there by job count in most recent accounts) and only showing hexes with a min of 50 employees. Hover over the map for a pop up of the sector there - trying to just use the key here is tricky, far too many categories / bad map! Patterns to look out for: the manufacturing doughnut drawing a circle from Sheffield through Birmingham and Manchester; the Southern sci-tech areas, also quite heavily present in Manchester. Also note where there are not a min of 50 employees - an interesting picture of the economic landscape. This was a first test to demonstrate how rich this dataset is if one can access the whole national picture (not just whatever count limits private versions of this data impose e.g. FAME). A lot of work yet to do though…
Higher resolution hexmap just for Yorkshire - 1km-across hex, with minimum ten employees per hex.
South Yorkshire’s four local authorities job percent change since last accounts, compared to ‘core cities’, using the last year’s Companies House data (hover for place name).
Companies House South Yorkshire shiny dashboard 1st draft. Click on firms for more details, view sector and change to ‘percent change from last year’.
Quarto online report example looking at business demography in South Yorkshire (note the nice hover-for-plot feature built in, surprise to me when it compiled!)
Intro to using linked ONS output and jobs data in R (with link to rest of pipeline)
Early draft shiny dashboard for location quotient plots and SICSOC comparison plots (that show relative job skill levels for the chosen ITL2 zone) - see the tabs.

I also mentioned the amazing Leeds Research Collaboration Framework and Centre for Cities’ LA Evidential report.

Outputs live list

Dan Olner — Wed, 12 Feb 2025 00:00:00 GMT

What’s this page?

A live list of stuff I’ve produced on in my policy fellow role, working with Y-PERN, University of Sheffield Management School and SYMCA.
The majority of this work is being carried out openly on my github repo, mainly in RegionalEconomicTools and ukcompare. I’m writing about it on my tech and non-tech blogs.
I’ll add to this same post as new stuff appears.

Outputs:

RegEconTools website. All of this work comes under the broad heading of “Open regional economic data and tools”; this website is where I’m collating as much of this work as I can. Here’s the website repo.

On the RegEconTools website currently:

Resource: Linked regional GVA and job count data, including more accurate BRES job counts¹.
Explainer / R guide:: intro examples for using the regional GVA / job count data (data in the link above).
Explainer / R guide: location quotients and proportion plots for UK regional sectors: processing and mapping (also using the above data).
Explainer / R guide: analysing regional GVA gaps (using the above data). Breaking down by ITL2, ITL3, core cities and mayoral authorities. (Used, for example, to estimate the GVA productivity gap between South Yorkshire and the UK for the Plan for Good Growth.)
Explainer / R guide: comparing different GVA productivity measures using Beatty & Fothergill’s range of metrics.

Other outputs:

South Yorkshire ‘Plan for Good Growth’ sector analysis summary document (PDF) hosted by SYMCA. This highlights some key sector ideas that went into informing the Plan (a small part of a large team’s work, including Metro Dynamics). More detail in the slide decks below. (2024)
Companies House Open project. From the github ‘about’ blurb: “UK Companies House data is already ‘open’ but it’s opaque and a PITA to wrangle. This repo takes care of a bunch of the PITA and makes this amazingly rich dataset more openly accessible.” While it has its weaknesses, it’s an amazing, granular dataset. Outputs from this so far include:
- Companies House open data dashboard for South Yorkshire (work in progress, aiming to make Companies Hhouse open-but-opaque data much more easy to access).
- “Most common sector per hex modal maps for GB, giving a quick view of the economic structure of the country. Examples: 1000m hexmap with overlaid ITL2 zones and local authorities, 10+ employees per firm min; same but 5000m and min 50 employees per firm - easier to see broad economic structure and north-south difference (note all the PST bands around London).
- Example of Companies House data use in broader economic analysis: comparing Bradford and Leeds, using CH (alongside ONS data) to dig deeper into most recent economic changes.
Report on business dynamism in South Yorkshire using ONS data that links geographies across time (the original dataset breaks that link by honouring every local authority boundary change; this finds common boundaries to produce a consistent time series). (2024)
Linking the Low Carbon Renewable Energy Economy (LCREE) dataset to GVA: exploring implications for sector analysis, jobs investment and which dials might affect green sector growth. Longer version with more detail; shorter summary version. Both word docs. Underlying quarto doc here; code in the regecontools repo.
Output in support of Local Growth Plan GVA gap analysis (2025), (1) interactive plot: five GVA measures for South Yorkshire expressed as percent of UK average, 2012 to 2022; (2) interactive plot: industry mix ‘if thens’ for South Yorkshire - “if GVA per job were UK average in all sectors, what would SY GVA be?” vs “if industry mix of jobs was same as UK average (holding SY sector GVA the same)…”; (3) interactive plot: Arts Council spend in Core Cities over time, percent of UK average; (4) interactive plot: Completed dwellings per resident for Core Cities over time, per 1000 residents (actual + rank), % of GB average. (Data and code for all these is in the ukcompare repo.)
Leeds childcare accessibility map, in support of work done by Thomas Haines-Doran for WYCA. See here from ONS for more detail about the accessibility values.

Presentations / events

South Yorkshire’s past, present and future: what does the data say?: presentation for the 1st SPERI seminar in a series about South Yorkshire’s political economy and history. Write-up on the Y-PERN blog here. 3.3.25.
“Data action for local growth: what do we want to build?” Presentation to ONS subnational conference, Leeds. All the details and slides in this post. 14.11.24
Y-PERN / SYMCA policy forum 1: GDP and beyond on the Y-PERN blog. 4.6.24
Y-PERN / SYMCA policy forum 2: ‘High-Skilled Growth - The Importance of Universities in Driving Yorkshire’s Economy’ Word doc writeup. 24.10.24
UPEN showcase: Regional Academic Policy Engagement in England (Y-PERN) - working with South Yorkshire Combined Mayoral Authority. Slide deck here. 2023

Techie talks

Sheffield R Users’ Group presentation: “Making Economic Data Accessible”. Slide deck here.
Sheffield Uni Data Stewards Event - presentation on using R to build pipelines for ONS and Companies House data, including links to various maps and plots.
ONS Local Presents: “Using R for regional economic analysis - taster session”. The lovely people at ONS Local gave me a chance to do a whistlestop tour of using R for regional data analysis. Session outline here (including link to getting set up with R either online or on own machine). Course outline here. Online slides here. Both have links to ways to use R online and get the data. A recording of the session is here.

Presentations supporting the SY Plan for Good Growth:

Slide deck for report to SYMCA on initial sector analysis findings highlighting structural change over time. (September 2023).
Slide deck for presentation to SYMCA plus local authorities of Barnsely, Doncaster, Rotherham & Sheffield: sector growth, significance, productivity, gva vs jobs.
Slide deck for SYMCA / Y-PERN joint meeting on growth and skills including SICSOC analysis showing which job skill level / sector combinations are significantly stronger or weaker between places.

Code for these is mostly in the ukcompare repo, including quarto/rmarkdown slides.

Bits and bobs

Bootstrap estimates of the link between earnings and skills for South Yorkshire using the Annual Survey of Hours and Earnings and Census qualification data. Estimate plot here, code here. Used in the South Yorkshire Skills Strategy to give a central estimate: “If 10% of the population in South Yorkshire with Level 3 earned wages equivalent to those at Level 4 or above in South Yorkshire, total earnings could increase by an average of £200m.” (p.14)
Job count sunburst interactive showing how SIC codes nest in South Yorkshire. Size of slice is number of jobs. Code here.

Footnotes

See the data page for why I think it’s possible to get more accurate job data from BRES directly.↩︎

Dan Olner — Wed, 20 Nov 2024 00:00:00 GMT

undefined

Open data and code used for ONS subnational data conference

Dan Olner — Sun, 10 Nov 2024 00:00:00 GMT

Off the back of presenting at the ONS subnational data conference, this post collects the open data / code I used in the slides, as well as a few extra bits mentioned in there.

The presentation talks about the huge value and power of ONS data for the UK: how it can help us understand where we’ve come from and where we are now – and so help us work out we want to go.

There’s a mix here of step by step data walkthroughs and raw code: I want to work on getting more of this into a form that’s as useable as possible, ideally through testing what actually is useful and iterating.

I’ll add in some to-do notes on things that need updating / changing / improving and change this page as those get done.

Data and code used

For the 1971-81 ‘scarring’ work (used in ‘Steel City: Deindustrialisation and Peripheralisation in Sheffield’ with Jay Emery and Gwilym Pryce):
- Harmonised Census data 1971 to 2011: country of birth and employment variables harmonised along with consistent geography. Full explanation in the readme there talks through how to get the data for country of birth (and further down the page there’s a link to the employment data). POSSIBLE TO-DO: HARMONISE WITH 2021 (and 1961???)
- That data is used in this RMarkdown output that produces the plots used (from this repo with general data stitching code). The data for the RMarkdown output, using the harmonised datasets, is processed in this R Script.
For the sector proportion plots, and other code on processing ONS regional GVA data for location quotients, mapping and other bits, see this code and data stepthrough on the regecontools site.
The productivity “GVA vs JOBS percent change” plots and map don’t have a good walkthrough yet - the code (including code to update to latest BRES and regional GVA data) is here in the repo for the first tranche of sector analysis work done for SYMCA, and is fairly readable and self-contained there. That BRES data is automatically extracted using the super-useful NOMISR package in this script and processed in this script (where it’s linked to the LCREE dataset, along with GVA data - work done in this script and then collated for a report in this Quarto doc). TO-DO: MAKE WALKTHROUGH FOR PROD PLOTS
The GVA per hour plot is part of this walkthrough on the regecontools page looking more broadly at some GVA per capita / per hour worked.
The Beatty / Fothergill rank change plot is from this fuller breakdown of their data, with code walkthrough, on the regecontools site.

References from the presentation:

Rice / Venables: “The persistent consequences of adverse shocks: how the 1970s shaped UK regional inequality” here
Sarah Willams, Data Action: Using Data for Public Good.
Martin A. Schwartz, “The Importance of Stupidity in Scientific Research.” Journal of Cell Science 121.
Peter Tennant on Bluesky talking about how we grow and why we need an open mind and be willing to be wrong.

Bits I didn’t manage to cram in the slides

Analysis of ONS business demography data that links local authorities across the dataset, including business ‘efficiency’ (balance of births and deaths) showing something shifted in more recent years in the south. (I write about automating your way out of an Excel data hole for this project here
The incredible Dutch secure data service data used in our Rotterdam project - paper here, supplementary material with a map here. Individual-level data! 100m^2, track over time! Link to other survey data! Secure, trustworthy, easy to use!
Northern Irish Census data - summarised down to 100m^2. Allows you to e.g. see Belfast like this (interactive map).

How to automate your way out of Excel hell & other ONS data wrangling stories (business demography edition)

Dan Olner — Tue, 05 Nov 2024 00:00:00 GMT

I’ve been analysing the latest ONS business demography data (that ONS pull from the IDBR). It contains a tonne of great data on business births, deaths, numbers, ‘high growth’ firms, survival rates, down to local authority level (though sadly sector breakdowns only at national level).

My working report from that is here - hoping to add more
Prep code is here
Quarto code for the report is here

Getting data out of Excel documents can be a bit extremely horrible [noting, to be clear, that Excel docs like this are super useful for many people, but just nasty for those of us who want to get the data into something like R or Python, so…]. In this case, what we’ve got is this –>

For each type of data (firm births, deaths, active count, high growth count etc) there are four sheets covering different time periods, with two spanning two years and two with a single year. Why? That’s unclear until you check the geographies - the local authorities (LAs) used don’t match across sheets. Why? Because the boundaries changed, so there’s a different sheet each year they’ve changed.

So if we want consistent data across all time periods, we’ve got a couple of things to do:

Get the data out of each set of four sheets into one;
Harmonise the geographies so datapoints are consistent.

Luckily, the LA changes have all been to combine into larger units over time (usually unitary authorities) - so all earlier LAs fit neatly into later boundaries. Phew. This means values from earlier LAs can be summed to the larger/later ones - backcasting 2022 boundaries through all previous data.

Some anonymous angel/angels made this Wikipedia page clearly laying out when and what local authorities combined into larger unitary ones in recent years. Using that, we can piece together the changes to get to this function that does the harmonising. It groups previous LAs - that only needs to backcast 2021/2022 names once, no faffing around with each separate sheet - and then summarises counts for those new groups, for previous years’ data.

Prior to that, though, we need to pull the sheets into R. There are a lot of sheets - doing this manually would be baaad…

The readxl package to the rescue! Part of the tidyverse, it can be used to automate extracting data from any sheet and set of cells in an Excel document. I do that in the function here, specifically for pulling out the correct cells from the ONS demography Excel. That’s used in the code here.

(Image stolen from here).

What this situation needs is another blog, said no-one ever

Dan Olner — Mon, 04 Nov 2024 00:00:00 GMT

“Another blog! Thank the Gods! Blogging is so now, isn’t it?”

That’s quite enough sarcasm from you. What this lovely website is for:

Using the ace Quarto blogging platform for writing up data / techie / code / mapping stuff in a much more straightforward way than using Jekyll (the previous github frontend, now archived here). RStudio just makes it for you! With some tweaks. Github repo for this blog is here.
A place to explain what I’ve done with R projects - not least explaining them to future me. Future me is very forgetful and needs to have things explained very simply to him
Get down the techie bits behind work I’m doing to support regional economic data analysis, so I can keep that separate from things like the regional economic tools site (also Quarto).

*Links to my github / linkedin / bluesky / wordpress site (or use links up above).

Here is a picture of a kitten on a unicorn, via here. You’re welcome.

Pub crawl optimiser

Dan Olner — Wed, 07 Dec 2016 00:00:00 GMT

Spatial R for social good!

Well maybe. Sheffield R User Group kindly invited me to wiffle at them about an R topic of my choosing. So I chose two. As well as taking the chance to share my pain in coding the analysis for this windfarms project, I thought I’d bounce up and down about how great R’s spatial stuff is for anyone who hasn’t used it. It’s borderline magical.

So by way of introduction to spatial R, and to honour the R User Group’s venue of choice, I present the Pub Crawl Optimiser.

I’ve covered everything that it does in the code comments, along with links. But just to explain, there were a few things I wanted to get across. (A lot of this is done better and in more depth at my go-to intro to spatial R by Robin Lovelace and James Cheshire.) The following points have matching sections in the pubCrawlOptimiser.R code.

The essentials of spatial datasets: (in ‘subset pubs’) - how to load or make them from points and polygons, how to use one to easily subset the other using R’s existing dataframe syntax. How to set coordinate reference systems and project something to a different one, so everything’s in the same CRS and will happily work together. (The Travel to Work Area shapefile is included in the project data folder.)
Working with JSON and querying services: a couple of examples of loading and processing JSON data using the jsonlite package, including asking google to tell us the time it takes between pubs - accounting for hilliness. This is very important in Sheffield if one wants to move optimally between pubs. Pub data is downloaded separately from OpenStreetMap but we query OSM directly to work out the centroids of pubs supplied as ways.
A little spatial analysis task using the TSP package to find shortest paths between our list of pubs - both for asymmetric matrices with different times depending on direction, and symmetric ones just using distance.
Plotting the results using ggmap to get a live OSM map for Sheffield. Note how easy it is to just drop TSP’s integer output into geom_path’s data to plot the route of the optimal pub crawl.
There’s also a separate script looking at creating a spatial weights matrix to examine spatial dependence. These are easy to create and do very handy jobs with little effort - e.g. if we want to know what the average number of real ale pubs per head of population in neighbouring zones, it’s just the weights matrix multiplied by our vector of zones.

The very first part of code that’s processing pub data downloaded from OSM - couple of things to note:

Just follow the overpass turbo link via the pub tag wiki page.
I remove the relations line ( relation[“amenity”=“pub”]({{bbox}}); ) just to keep nodes and ways.
Once you’ve selected an area and downloaded the raw JSON, the R code runs through it to create a dataframe of pubs, keeping only those with names. It also runs through any that are ways (shapes describing the pub building), finds their points and averages them as a rough approximation of those pubs’ point location. I could have selected a smaller subset of data right here, of course, but wanted to show a typical spatial subsetting task.

A couple of friends have actually suggested attempting the 29 pub crawl in the code (below, starting at the Red Deer and ending at the Bath Hotel). I am not sure that would be wise.

So what would you want to see in an essential introduction to spatial R for anyone new to it?

Migration entropy

Dan Olner — Sat, 16 Jan 2016 00:00:00 GMT

Preamble

One of the parts of my new job (both here and here) is a project examining how migration and a host of other spatial-economic and social things interact. This is awesome news for me: the movement of people (and its interaction with the spatial economy) was both essential to the PhD and a mirror to the stuff in GRIT.

I’ve got a long long way to go with the topic - some of the best social science has been done in this area for a long time, I’ve got a lot of catching up to do. So this post is just an initial, probably hugely misinformed, maybe plain dumb, ramble - and an excuse to build a little agent-based model in R (not sure I’ll be doing the latter again - back to Java and exporting result to R - but it was fun!)

My initial hook into this post was hearing the idea of white flight (or ‘native flight’ too) in some presentations. The focus was specifically about how immigrants external to the UK might be causing this. With an agent-modelling head on, it feels like you could get something that has its characteristics while actually being little more than random movement plus spatial economics. And, especially, that one has to be very careful to separate out the driving economic forces from the people themselves. That might end up meaning exactly the same thing, but…

To put it another way: I’ve got this, still currently very vague, sense that you could find statistically significant patterns just by arbitrarily labelling one bunch of people as ‘x’ and another ‘y’. Somewhat trivially obviously, you wouldn’t be able to do any quant work if those groups hadn’t been labelled differently - but I want to know if you arbitrarily labelled a random sample, perhaps, how would you tell the effects apart?

A simple thought experiment to illustrate the point. Imagine a variation on Maxwell’s Demon: a box with two halves, joined by a gap that, over time, produces a maximal entropy state, perfectly mixed. Initially all molecules are identical, but the demon has the power to arbitrarily deem 50% of the right-hand box as ‘blue’ and the rest across both boxes ‘red’.

Suddenly, rather than an entirely boring statistical evenness, the red left are being invaded by blues (coming over ere with their entropy-maximising randomness). One could show this by measuring the percentage of red vs blue in the left box as it rapidly dropped (which I do below, with a few additions to this thought experiment). But nothing has changed apart from the labelling - the same particle motion is taking place.

It’s a dumb idea but it gets the point across: there’s a labelling effect that could, in theory, mislead if the underlying process involved is not accounted for. Or alternatively, it’s not misleading at all if that designation of different groups is, in itself, a real feature of social life. Which it obviously is in some ways - but it’s still a tricksy idea. (Compare with Akala talking about the, seemingly entirely arbitrary, difference between ‘immigrants’ and ‘ex-pats’.)

Just to re-iterate, none of this is probably relevant to the work that triggered this thought. This is just me working through my intuition. I’m guessing it’s easy enough to distinguish area effects for places with the same overall characteristics/migration-flows but separate out the effect of differing groups. But let’s just carry on with the thought process anyhoo.

Coupla lit bits

There are a couple of facts from my first head-butt of the literature that jump out. First-up are the basic demographic differences involved. Not only do migrants from outside the UK tend to be much younger, but there’s a difference in internal migration rates between ethnic groups too (though obv, best not to conflate ethnicity and external migration!) This is analysed in Finney/Simpson 2008. Their key finding is that, once demographics are controlled for, ethnic groups in the UK -

do not have a significantly different migration rate from the White Briton group when group composition is accounted for. [76]

They also mention, in passing, that:

accommodation that is privately rented is occupied by residents that are almost twice as likely to have migrated in the past year than the average resident.[72]

And vice-versa - home-owners are much less likely to move, a finding that’s consistent across all groups. The Fotheringham et al 2004 paper - a stupendous piece of work - looks just at out-migration rates (between 98 ‘family health service areas’) in England and Wales. They’d found -

a strong positive relationship between out-migration rates and the proportion of nonwhite population in an FHSA…

And, mechanism-wise, they saw two possibilities:

The generally positive relationship could be caused by the white population leaving areas of mixed race or by nonwhite populations having higher migration rates.

Finney/Simpson’s work suggests the latter - but that this is due to demographic differences. Also, Among the bzillion dynamics Fotheringham et al analyse, one that jumped out at me was:

Higher out-migration rates are associated with areas of high employment growth, suggesting a high turnover effect operates in such areas. That is, in-migration volumes will be high into such areas because of high employment growth, but recent migrants tend to be highly mobile and out-migration rates are therefore also likely to be high. [1666]

So we’ve got this high churn going on in economically attractive places - which also connects to the housing market, of course. More property-owning pushes an area away from this high-churn. And that could go both ways, couldn’t it? High turnover, from a Putnam perspective, could undermine some of social-capital formation and knock on to house prices. Those areas, if generally younger, might also be more urban, less desirable by older groups looking for family homes.

Or prices could be pushed up if people are piling in - but you can see equilibrium pressures at work as out-migration rates increase too.

Model pre-amble

Which segues me nicely into to the following silly little model. I’ve got a very long way to fully mapping out the dynamics involved but, here, I just wanted to get started with something very basic. This post is also an attempt to persuade R to do a simple little agent/stochastic model. I’ll wibble a bit at the end about the coding experience…

So this is a sorta-ABM with zero-intelligence agents making the simplest probabilistic moves. It looks like this:

Nine hundred agents initially split evenly between three zones.
All agents have the same 1 in 100 chance of deciding to move on every timestep…
- Though that 1 in 100 chance is weighted slightly by the population of their current zone. If it’s more than an even proportion of the total population, their chance of wanting to move is increased slightly, and vice versa.
Once an agent decides to move, they have a different function to choose where to move.
- For two-thirds of them, they have no preference - they’ll decide to either stay, or move to one of the other two, with equal probability (but weighted by population).
- One third of the agents, however, will have a preference for two of the zones (or a preference against one of them - same thing). This could be slight or large, depending on their preference set.

A few things to note before getting to the code:

The 1/3 agents are arbitrary - it’s 1 in 3 in each zone initially but it doesn’t matter. This is one aspect of the way the problem is thought about that I’d like to be sure I’m thinking straight on - if one (a) marks out a group of people as a specific sub-group and then (b) examine how that sub-group’s flows affect others’, might the result be an artifact of the labelling itself?
I haven’t dug into the original pieces of research in enough depth to make any sweeping statements. So, just to be clear, this piece isn’t in any way meant as a criticism of anyone else’s approach - it’s entirely just me thinking through some of the most trivially basic mechanisms that might be involved.
The population-weighted probability of moving I’m using is a way to push zone numbers back to equilibrium. It could stand in for any kind of pressure to move that agents might come under, from house prices to environment. I’m also aware there are plenty of mechanisms, in reality, that can make larger zones more attractive, not less, e.g. through Krugmanesque increasing returns feedbacks. The assumption here is that all those forces balance to a net-negative response to higher population.
Since I haven’t posted one of these little models for a while, I should point out that, not only do I think this kind of simple model is useful, I believe they can be extremely powerful and criticisms about lack of realism miss the point. See PhD chapter 3. But then I would say that, I suppose.

Right, that’s a lot of wiffle. On to…

The actual code. 1: Set up.

library(plyr)
library(reshape2)
library(ggplot2)
library(zoo)#For running means

‘Store’ just stores each timestep’s data for outputting once the model’s run:

#Store of time series data (to match how table gets converted to dataframe for rbinding)
#Time is iteration
store <- data.frame(zone = as.integer(), 
                    agent_type = as.integer(),
                    number_of_agents = as.integer(),
                    time = as.integer())

Then set per-zone population, each agent’s base probability of moving and the number of timesteps (though note below, I’ve hard-coded stuff that only works with 300 agents per zone… oops).

#population per zone
#(so they can be evenly distributed to zones to start with)
n = 300
#probability of an agent wanting to move on each turn
#1 in 100
p = 0.01
#iterations
ites = 1500

As mentioned, there are two agent types: two-thirds of agents don’t care where they move, if they’ve decided to move. The other third have a preference. These two preferences are set by giving each agent type its own vector for selecting a zone to move to.

This was one easy way of defining how a sub-group can be biased towards two zones: if, as here, their choice vector is 10 / 10 / 1, they only have a 1 in 21 chance of deciding to move to zone 3. (Note the range of other preferences for ‘bias’ in the comments.)

#even probability of choosing any zone (including my own)
even <- c(rep(1,each = 10),rep(2,each = 10),rep(3,each = 10))
#preference for zones one and two
#Slight preference
# bias <- c(rep(1,each = 10),rep(2,each = 10),rep(3,each = 8))
#Weaker preference
# bias <- c(rep(1,each = 10),rep(2,each = 10),rep(3,each = 3))
#Even weaker preference for one zone (thus stronger for other two)
bias <- c(rep(1,each = 10),rep(2,each = 10),rep(3,each = 1))
#Won't ever choose 3 (useful for testing assignment works)
# bias <- c(rep(1,each = 10),rep(2,each = 10))

Each ‘agent’ is just a row in a dataframe. Each row has an agent’s current zone, whether it’s going to move this turn and a reference to its preference (!). So it’s here that we set 2/3s of agents to ‘don’t care which zone’ (even) and 1/3 to ‘bias’.

A probability column gets added further below that determines their first decision - ‘shall I move this turn?’ Like most agent models, this is a little bit markov-chainy. I think.

#THE AGENTS:
#Keep them all in a single long dataframe
#assign agents to zones initially evenly
#One row per agent
#'moving' is flag: am I moving this turn?
#'prob': flag for which zone probability to use. 
#0 is even; 1 is biased
#Set a third of agents to prefer zones 1 and 2.
#Distribute them evenly between zones to start with
agents <- data.frame(zone = rep(1:3,each = n), 
                     moving = 0, 
                     prob = rep(c(0,0,1),times = n))#codes 2/3 majority

Just to show exactly what that creates: Zones 1 to 3 each have 300 agents in, and there are 200 who don’t care where they move to (0) and a hundred (1) that will use the ‘bias’ probability.

table(agents$zone, agents$prob)

Each timestep produces a ‘result’ table that summarises the number and type of agent per zone. Each of these ‘results’ goes into the ‘store’ dataframe for graphing later. But we need an initial ‘result’ to start with, as it’s used to work out how to weight moving probability on the next timestep - but the first timestep needs one too! So this one is just hard-coded to match the agent table above. I should probably work out how not to hard-code this. I’m not going to right now. So!

#WARNING: hard-coding the numbers for this first set of values based on 900 agents in total
#and a 2/3 majority
#So this matches store structure and initial agent state:
result <- data.frame(zone = rep(1:3, times = 2), 
                    agent_type = rep(0:1,each = 3),
                    number_of_agents = c(rep(200,each=3),rep(100,each=3)),
                    time = 0)

And that’s everything set up. On to ->

2: Running the model…

Here’s the model for-loop itself:

###########
# RUNRUNRUN
for (i in 1:ites) {

  #1. Weight probability of moving by population in each zone
  #Find zone population for this timestep
  zonepop <- aggregate(result$number_of_agents, by=list(result$zone),sum)
  
  #Sensible names for following the logic...
  colnames(zonepop) <- c('zone','population')
  
  #Weight probability of moving by population difference from even
  # zonepop$newprob <- (zonepop$x/(n)) * p
  #raise to power to make a larger effect (but 1 stays 1)
  zonepop$newprob <- ((zonepop$population/(n))^4) * p
  
  #drop any previous newprob column from agents
  agents$newprob <- NULL
  
  #Merge the probability for each zone into the agents
  #zonepop columns one and three is just 'zone' (for matching)
  #and the new probability of moving
  agents <- merge(agents, zonepop[,c(1,3)], by = 'zone')

This first section weights each agent’s probability of moving to the size of the zone they’re in. We know the populations are all even on the first step, so the initial ‘result’ above just uses the base probability, but on future steps it’s higher if more crowded, lower if less.

Note that the probability-calculating line raises the ‘zone population’/‘agent number’ ratio to the power of 4. This makes any deviation from an even population have an increasingly strong effect on agent’s likelihood of deciding to move (or stay, if the population’s lower than even.)

And then this just returns 1 if I’m deciding to move on this timestep:

  #2. Using weighted prob... Am I moving this turn?
  agents$moving <- rbinom(nrow(agents), 1 , agents$newprob)

You can see this produces roughly a 1 in 100 chance of each agent moving…

table(rbinom(nrow(agents), 1 , agents$newprob))


  0   1 
892   8

But if population is higher in all zones (which it can’t be in the model, but just to illustrate). 10 times the current probability of 0.01 is about 1 in 10 deciding to move:

table(rbinom(nrow(agents), 1 , rep(10 * p,nrow(agents))))


  0   1 
823  77

table(rbinom(nrow(agents), 1 , rep(10 * p,nrow(agents))))


  0   1 
803  97

table(rbinom(nrow(agents), 1 , rep(10 * p,nrow(agents))))


  0   1 
813  87

And if lower in all zones, agents are more likely to stay put:

table(rbinom(nrow(agents), 1 , rep(0.1 * p,nrow(agents))))


  0   1 
898   2

table(rbinom(nrow(agents), 1 , rep(0.1 * p,nrow(agents))))


  0   1 
899   1

table(rbinom(nrow(agents), 1 , rep(0.1 * p,nrow(agents))))


  0 
900

For those who are moving, in this next step, they decide where to go. The fiddly part here are the two selections from the ‘even’ and ‘bias’ vectors that tell agents which zone they’re moving to. To explain it as much for my own later sanity as anything, here’s what’s going on. We’re just selecting a random index from each of them. In the case of ‘even’, 1,2 and 3 have the same chance of being chosen (as can be seen if we whack the number of random selections right up). Whereas ‘bias’ ends up telling about a tenth the number of biased agents to move to #3:

table(even[floor(runif(n = 1000000, min=1, max=length(even)+1))])


     1      2      3 
333529 332695 333776

table(bias[floor(runif(n = 1000000, min=1, max=length(bias)+1))])


     1      2      3 
475498 477214  47288

In the zone selection itself, each random selection is the same length as agents of that type. As the comments note, I’m a little amazed this works - I’m not really clear on how that random vector can be created and then, via an ifelse, be assigned to the correct index… oh well, it works!

  #3. If moving, where to?
  agents$zone <- ifelse(agents$moving == 1,#If I decided to move...
                        ifelse(agents$prob==0,#Move based on my preference of zone (or no preference)
                               even[floor(runif(n = nrow(agents[agents$prob==0,]), min=1, max=length(even)+1))],
                               bias[floor(runif(n = nrow(agents[agents$prob==1,]), min=1, max=length(bias)+1))]),
                        agents$zone
  )
  
  #Explanation for the zone selection above, since I'll probably forget.
  #Choose a random position from my (either even or biased/weighted) array of zone choices
  #Passing in a uniform random pick from each of the choice arrays
  #Of the correct length (the nrow subset)
  #It's still some form of witchcraft though - how does R know to distribute
  #the result to the correct index via the ifelse???

The result for this timestep is then stuck into the store for output later (also adding in a column to mark the current iteration). Converting a table to a data.frame reshapes it so it’s the right orientation to bind to the ongoing ‘store’ of results:

  #GET THE RESULT OF THIS ITERATION AND STORE IT
  #automatically reshapes, it turns out
  #so zone is in first column, zone pref type in second
  result <- as.data.frame(table(agents$zone,agents$prob))
  
  #Rename those fields to something sensible
  colnames(result) <- c('zone','agent_type','number_of_agents')
  
  #mark what iteration it is
  result$time <- i
  
  #Then add this step to the end of the data store
  store <- rbind(store, result)
  
}#end for

Display results

So that’s the results found. Now to show ’em. First-up, let’s add some extra data for total population per zone on each timestep:

#Find "total population in each zone at each iteration"
#Will be added as an extra bunch of rows to the output dataframe
#To fit the 'long' format ggplot wants
totpop_perzone_timestep <- aggregate(store$number_of_agents, by=list(store$zone, store$time), sum)

colnames(totpop_perzone_timestep) <- c("zone","time","number_of_agents")

#Make a new column for faceting the data.
#This one will be total population per zone
totpop_perzone_timestep$facet <- 'total pop'

#Relabel store's two agent types so each can have its own facet
store$facet <- 'agent-type: no pref'
store$facet[store$agent_type==1] <- 'agent-type: bias'

#Drop old agent_type column
store$agent_type <- NULL

#Stick 'em together in a new store
store2 <- rbind(store,totpop_perzone_timestep)

Now the data’s ready - just one nice little addition by combining ddply and rollmean from the zoo package to give us a running mean. This can help show the trend over time in a simple way. This sort of thing is really satisfying in R, when it works. One line! So ddply is applying the running mean for the number of agents in each zone/facet sub-group:

#Running mean...
smood <- ddply(store2, .(zone,facet), mutate, 
               rollingmean = rollmean(number_of_agents,250,fill = list(NA, NULL, NA)))


output <- ggplot(smood) +
  geom_line(data = smood, aes(x = time, y = number_of_agents, colour = zone), alpha = 0.3, size = 1) +
  geom_line(data = smood, aes(x = time, y = rollingmean, colour = zone)) +
  facet_wrap(~facet)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

output

Warning: Removed 747 rows containing missing values or values outside the scale range
(`geom_line()`).

And there’s the basic result, then:

The ‘bias’ agents (left-hand plot) move more to zones 1 and 2, as you’d expect.
The ‘even’ agents in the middle plot, who (all other things being equal) don’t care where they go, end up having a larger number in zone 3 as they respond to the pressure of increasing population. Remember, that pressure is only coming from one-third of the agents.
Overall, zones 1 and 2 end up with higher total populations because of the minority groups’ preferences.

I’ve not tested whether/at what point the ‘bias’ agents’ preference function wouldn’t outweigh the population push but, here at least, their large preference for those two zones wins out.

To look at it from another perspective, the following re-jigs the data so we have the proportion of ‘even’ vs ‘bias’ agents in each zone:

#Different output for looking at the proportion change of the two groups in each zone
#Use the original 'store' for this
#Convert wide so that each agent type has its own column
#To make finding proportion  per zone easier
proportions <- dcast(store, zone+time ~ facet, value.var = "number_of_agents")
 
#Proportion of majority as percentage of whole
proportions$percent_majority <- (proportions$`agent-type: no pref`/
                       (proportions$`agent-type: bias`+proportions$`agent-type: no pref`))*100

output <- ggplot(proportions, aes(x = time, y = percent_majority, colour = zone)) +
  geom_line() 

output

Zones 1 and 2 see a lot of ‘even-agent flight’, it seems. Which makes perfect sense given the trivially simple dynamic: it’s nothing more than a response to some equilibrium pressure as one group who prefers a zone (for whatever reason) decides to move there.

All agents, regardless of type, are affected in the same way by population pressure: their decision to move is the same. They only differ in where they prefer to move to. Many of the ‘bias’ group prefer to move somewhere in the same zone or a similar one.

I’m really labouring the point now, I know, but… the point being, assigning causality here is a little murky. Without the ‘bias’ groups’ preference, the ‘evens’ wouldn’t end up dominating zone 3.

This is, in a way, just a slightly different take on the Schelling segregation dynamic, except that it’s not about people’s preferences for any particular type of neighbour, but rather some people’s preferences for particular places, and what the knock-on effects of that could be.

Random coding wibble

So that’s enough ill-informed migration wiffle. On to coding wibble. That was in some ways amazingly easy to set up, and R does some things just beautifully. The tricky part: I went away for a month and then it took me about two hours of staring to figure out how it worked. I fixed that by naming variables sensibly. Phew.

As far as I was able to figure, there was no way to circumvent the main for-loop and subsequently it’s pretty slow. “As far as I was able to figure” isn’t very much, so perhaps there’s a way of making this more R-native - though I suspect the kind of non-ergodic timestep processes that drive ABM might not be R’s forte. Though though: the slow part is actually the rbinding and table-making, so there may be another way.

On the plus side, I can write it up like this in RMarkdown… though perhaps Python will let me do that too, if I can get round to trying it. And my first thought on coming back to the model after a break: if this were Java, it would be perhaps make a lot more sense right now, and running faster. OOP and ABM go together: it’s very easy to see what the agents are. Here, pleasing in its brevity but challenging to keep all the working parts in mind.

UK trade flows

Dan Olner — Wed, 26 Nov 2014 00:00:00 GMT

This is one of the fun things I coded up in the process of developing the last grant I worked on. I’ll explain a bit about it and then share some thoughts on whether it’s any good as a visualisation. There’s a sharper HD version of this video here and a dist.zip file on the github page if you want a play.

Your standard input-output table takes a bunch of economic sectors and, in a matrix, gives the amounts of money flowing between each of them. For the UK, we’ve got ‘combined use’ matrices that include imported inputs moving between sectors, as well as domestic use only, excluding imports. (These two work with different types of prices, though, so they’re not directly comparable.)

This is the boiled-down version of the data I use, from the first data link above: the 2012 combined use matrix. Github gives you a scroll bar at the bottom to view the whole CSV file. The sector names are only in the first column, but they also apply to each column heading along the top. So, for example, the first number column starting with 2822: this is what ‘agriculture, hunting, related services’ spends on other sectors. So the first value is what agriculture spends on itself (it’s in millions of pounds; the matrix diagonal gives the amounts each sector spends on itself.) This is a tip from Anne Owen that’s always helped me: think of each column as a receipt of what that sector has bought. So summing the receipt gives you that sector’s total consumption. Summing each row gives you its total demand - how much others buy from it.

The visualisation shows what this matrix looks like if you stick it into a force-based layout and make each money flow a moving circle. The live version is interactive, allowing you to explore sector connections.

So: any good as a visualisation? Before I’d produced it, I would have said, mmmm - not really. It’s fun to play with but doesn’t really convey information. It does manage to give a quick overview of the relative size of sectors and how much money moves between them, but you can’t ask it any useful quantitative questions. I’ve since learned a lot more about the internal structure of these IO matrices using R - perhaps that’s something I’ll come back to. I have also coded a ‘random walk centrality’ test (that code is in the source files, though it’s turned off at present) - so it’s certainly possible to use the network structure to do some analysis.

Something unexpected happened with this visualisation, though. It engaged people. Prior to this, I probably wouldn’t have thought that was an important thing but, looking back, having something like this that’s able to draw someone in - that’s turned out to be very useful. One of my colleagues used it in a tutorial and apparently they were really taken by it.

That kind of initial hook can be enough to make someone want to find out more. That’s been a useful lesson for me. If I were drawing up a criteria list for successful visualisations, this one’s made me think of adding ‘engagement value’ or ‘hook power’ or somesuch. This IO viz has plenty of that. I think it manages to give an impression of the economy as a whole that would otherwise be hard to see. (Though there are reasons to distrust the picture it paints: it tells you construction is by far the biggest sector - it wasn’t until 2013, when ONS took three separate construction sectors and combined them.)

But another visualisation criteria should, of course, be ‘does it communicate information effectively?’ This doesn’t manage so well. Perhaps the ideal is to maximise communication / information / hookiness. Perhaps there’s a trade-off there too - making something that might initially make a person go ‘wooooo’ will probably mean, after a few minutes, they’ll realise it’s a bit meaningless.

Even so: prior to this, it would never have occurred to me that hookiness could be useful in itself. For the grant, this viz helped me say: “look, these are the money flows moving in the UK. We want to want to know where in the UK they move”.

This is also a good example of why I still like Java. There’s a lot of work going on there - it would likely run unuseably slow in javascript. This takes us straight back to the ‘wooo/information’ trade-off though. One might argue the computationally intense stuff it’s doing is useless for conveying information - and including it, insisting on a more powerful codebase, is cutting it off from an easily accessible home on the web.

Dan Olner's Data Dispatch

Getting started with using R and RStudio (in the cloud or on your own computer)

Using R and RStudio: (1) posit.cloud in the browser, or (2) running on your own computer

If using RStudio online: set up a posit.cloud account and create your RStudio project

Make a new R script and add a library

Running code in an R script / loading the tidyverse library

Footnotes

Data Stewards’ Network meeting on workflows

Outputs live list

What’s this page?

Outputs:

On the RegEconTools website currently:

Other outputs:

Presentations / events

Techie talks

Presentations supporting the SY Plan for Good Growth:

Bits and bobs

Footnotes

Open data and code used for ONS subnational data conference

Data and code used

Other links

References from the presentation:

Bits I didn’t manage to cram in the slides

How to automate your way out of Excel hell & other ONS data wrangling stories (business demography edition)

What this situation needs is another blog, said no-one ever

Pub crawl optimiser

Spatial R for social good!

Migration entropy

Preamble

Coupla lit bits

Model pre-amble

The actual code. 1: Set up.

2: Running the model…

Display results

Random coding wibble

UK trade flows