Sep 17, 2017 - Learn ggplot and R using English house price and wage data


I recently ran a day course on visualising data with R for the Urban Big Data Centre in Glasgow. I used the rather high-falutin’ phrase ‘principles of visualising data with ggplot’ - why principles? Because (a) once you get the basic principles of ggplot, you can run with the rest and (b) they’re not all that obvious when you’re starting out. So the aim of day was: get to the point where the ggplot cheatsheet makes some sense. (We also worked with the data wrangling cheatsheet; you can get to both of these via RStudio’s own help menu too.)

Same as with the last workshop I’ve written it up so it can be worked through by anyone.

  • There’s a pre-amble course chunk for anyone totally new to R. It goes through the basics of R, RStudio and dataframes - it’s in the form of an R project. That zip can be downloaded from here. The PDF is in the course booklet folder.

  • The main course folder can be downloaded here. It’s the same folder structure but doesn’t have an already-made RStudio project, so you’ll just have to make a new one - that’s explained in the course booklet (again, in the folder of the same name).

The day used two datasets, pre-wrangled into more friendly form. First, Land Registry house price data for England - the original data can be downloaded in one ~4gb chunk from here. Second, median wage data at local authority level from NOMIS. There was also a bit of Census employment data in there.

One of the main things to get across about principles: data wrangling and visualisation are two sides of the same coin. The tidyverse has done a sublime job of making those two things work smoothly together in R. As Hadley sez: ‘by constraining your options, it simplifies how you can think about common data manipulation tasks’. I went for a tidyverse first approach and I think it paid off. As long as you’re very careful to be clear what’s actually happening with things like the pipe operator (clue’s in the name so it’s intuitive enough once explained), it’s all good. It’s also much, much less boring than wading through the entirety of base-R first. Bits can be introduced where needed but you don’t need everything.

At some point I’ll post up the pre-wrangling code: it’s useful for showing how to get Land Registry housing data geocoded using the - also freely available - codepoint open postcode location data, binning house location points into geographical zones etc.

I wasn’t sure how much of the original housing data would fit onto the machines we were using, so only included a subset of data to use (that’s in the zip above). The whole thing for England can be downloaded here as an RDS file. Load into R (assuming you’ve downloaded and put into a data folder in your R project) with the following. It decompresses to about 1.7gb in memory.

sales <- readRDS('data/landRegistryPricePaidEngland_w_LADs_allchar.rds')

The most startling part for me was linking house price data to the NOMIS median wage data - I talk about that a lot more in its own post here. tl;dr: London! Oh my God!

The optional extras at the end include something on prettifying graphs, how to cram as much into one graph as humanly possible. This quote’s been doing the rounds: ‘a designer knows they have achieved perfection not when there is nothing left to add, but when there is nothing left to take away’… RUBBISH! THROW IT ALL IN THERE! EVERY FIGURE IS 250 WORD EQUIVALENT IN PAPERS! Well, no, it’s not rubbish, obviously, but there are times…

There’s also a bit on outputting multiple plots to a folder and lastly, inspired by the amazing R for Data Science book, a section on visualising coefficients and error rates from many regression outputs. Statistically dodgy (spatial/temporal data with no account of autocorrelation) but gets the point across. Because regression tables schmregschmession schmables, I say. See below: it’s just Census employment data versus house prices in some TTWAs. Oh and then a little easy mapping thing to end to illustrate another idea for using dplyr.

Any questions, feel free to ask: d dot olner at sheffield dot ac uk or on that there Twitter.

House prices as a multiple of wages


Pulling out model values into a viz

X axis: percent increase in house prices for a 1 percent increase in employment in wards for those TTWAs. (Note point above: illustration of pulling out model values, not a good model!)


Sep 17, 2017 - English House prices - how have they changed compared to median wages


This post looks at how to plot house prices as a multiple of yearly wages for English local authorities. There should be everything here you need to recreate the plots - as with the one day training course this comes from, you’ll need to stick the data in a data folder in an R project first. (The post on the day course links to the materials if you need guidance on doing that.)

Without adjusting raw house prices, their value over time is not really comparable. Adjusting them just for straight inflation can be misleading as house prices aren’t actually included in that. A better measure is ‘cost of housing as a multiple of wages’. This also allows for a more geographically interesting picture. NOMIS median wage data is available yearly for each local authority and goes back to 1997, so you can build a house price index for each local authority up to 2016. (E.g. ‘this house is worth four times the yearly median wage here.’)

The graph in the previous post shows this index for the top and bottom five local authorities (from a rank of 2016 wage multiples). But I’d left out the top value - Kensington and Chelsea. It’s so far above the other values, it makes them rather unreadable. (During the course, we did look at Kensington/Chelsea in another section, so it was covered.)

Here’s what it looks like with Kensington and Chelsea included. Note: this uses a pre-selected sub-set of the more well-known local authorities we used on the course. For reproducing it yourself, I’ve included code and data below that has all the local authories - London, of course, has more of the top values once all are used. Note note: this is using mean house prices per local authority - there’s also code below for using the median, which brings the values closer to those in e.g. this Guardian article. I think using mean house prices might give a better picture of how some very high value properties (of which London has plenty of course) affect affordability - views welcome.

Anyway - data! –>


Yikes. The average house in Kensington and Chelsea has gone from just under twenty times the median wage to not far off seventy times now. Good God. Part of the reason is shown if you just plot the wage multiple against the median wage itself:


Kensingon and Chelsea is out there on the right by itself: actually not that high a median wage but by far the largest wage multiple for housing. So what this is likely showing: as well as having some of the priciest housing, it also has many lower-waged people. It’s a highly mixed area. In some imaginary future (I’m sure this would never happen…) where much of the area’s social housing tenants had moved elsewhere, the wage multiple would drop.

The NOMIS data actually contains every wage percentile - it’s very gappy for 90th percentile as they don’t include numbers that are not statistically reliable and there aren’t enough people in that category. At any rate - looking at other percentiles might provide a deeper look at what’s going on.

The lowest-value areas (easier to see in the minus-Kensington graph in the other post) have certainly seen the gap increase - something like five times to ten times the median wage over the data period. But there’s also a dynamic that I’ve seen in data at local scales too: after the 07/08 crash, the poorest areas have seen house values decline (even relative to wages, it appears) where the richer areas have continued on up. For some of the richest, the crash looks like little more than a dink in that.

There are some other slightly puzzling areas with very high median wages and low house values. I’m a little suspicious of the data for those (e.g. Copeland in the point plot above) but more digging also required…

Code and data to play with yourself

To end: here’s some R code with links to tidied data for both median wages and English house prices if you want to reproduce this or look at different areas. It’s about the same as the course code but (1) it uses all local authorities so London now has six of the top seven values; (2) I’ve added median house price in too (plotted at the end), which is more in line with the values in the Guardian article linked above. At some point I’ll come back to grabbing and tidying the data / getting the Land Registry data geocoded etc.


#Load median wages at local authority level 97 to 2016
#taken from NOMIS and tidied
#This is in the course folder or download from here then stick in data folder for project
wages <- read_csv('data/medianWages_localAuthority.csv')

#Gather year into a single column
wagesLong <- wages %>% 
  gather(key = year, value = medianwage, `1997`:`2016`)

#If you want a quick look at the wage data...
# ggplot(wagesLong, aes(x = year, y = medianwage)) +
#   geom_boxplot()

#Load all tidied English housing data, not just sub-selection
#Download from here then stick in data folder for project
sales <- readRDS('data/landRegistryPricePaidEngland_w_LADs_allchar.rds')

sales$year <- year(sales$date)

#Summarise at house prices at local authority level and per year
saleSummary <- sales %>% 
  group_by(localauthority,year) %>% 
  summarise(meanPrice = mean(price),
            medianPrice = median(price))

#Make year numeric
wagesLong$year <- as.numeric(wagesLong$year)

#Join house price and wage data
price_n_wage <- inner_join(
  by = c('year', 'Area' = 'localauthority')

#Make new variables using mutate - yearly wage and house price as multiple of that
price_n_wage <- price_n_wage %>% 
    medianwageyearly = medianwage *52, # weekly wage to yearly wage
    wagemultipleFromMean = meanPrice / medianwageyearly, # house price as multiple of that yearly wage
    wagemultipleFromMedian = medianPrice / medianwageyearly # house price as multiple of that yearly wage

#Use 2016 wage multiple as an index for picking out the top and bottom England values
#Handy to look at directly
price_n_wage2016 <- price_n_wage %>% 
  filter(year == 2016) %>% 

#Zones to show in plot (one of at least three ways of doing this)
#Include Kensington and Chelsea, exclude last entry as no wage data 
zoneselection <- price_n_wage2016 %>% slice(c(1:7,309:315))

#Not South Bucks - rest of top 7 are in London
londonzones <- zoneselection$Area[1:6]

#Add flag for London zones
price_n_wage <- price_n_wage %>% 
  mutate(inLondon = ifelse(Area %in% londonzones, 'london','other'))

ggplot(price_n_wage %>% filter(Area %in% zoneselection$Area), 
       aes(x = year, y = wagemultipleFromMean, colour = fct_reorder(Area,-wagemultipleFromMean))) +
  geom_line(aes(linetype = inLondon), size = 1) +
  ylab('wage multiple') +
  labs(colour = 'local authority', linetype = ' ') +
  scale_y_continuous(breaks = seq(from = 0, to = 80, by = 5),
  labels = c('0','5x','10x','15x','20x','25x','30x','35x','40x','45x','50x','55x','60x','65x','70x','75x','80x')) +
  ggtitle("House prices as multiple of yearly wage\ntop and bottom seven English local authorities\nFrom mean house price") +
  theme(plot.title = element_text(face="bold",hjust = 0.5),


And the same plot but with median house price


Jun 16, 2017 - A script for automating RMarkdown to Jekyll


I’m going to be making a little more of an effort to get some coding up on this blog - not least because I’m always going, ‘now where’s that chunk of really useful code that does x and y?’ and having it here would make it easier for me to find.

So to start with, here’s something a little bit meta. I’ve been using RMarkdown on Windows to create blogposts for Jekyll, and part of that’s been using this little script to Knit the Post into jekyll-ready markdown.

So this post assumes you’re already doing something similar and know how to write in RMarkdown / copy the resulting Knitted posts and figures to your Jekyll site etc. At some point I need to write a full start-to-finish guide to doing this: it’s got a lot easier with some recent changes too. But for now, I just want to look at how to make the process smoother by automating some of the faff.

It might be things have moved on and there’s some much easier way of doing this internal to RStudio - do let me know if so!

The KnitPost function above creates the md file and also puts any figures in a folder in the local project. You then have to copy these to the correct place in your Jekyll folder - and also edit the header so it’s Jekyll-ready. Once you’ve done all that, you can preview locally using jekyll serve before pushing to the interwebs (as I’m doing here on my site).

Which is great - but if you make an edit in your RmD, you have to then go through that process all over again. Mucho Fafferoo. I always have to make minor edits after I’ve posted something, as I suspect most people do: you don’t see a lot of errors or typos until you’re previewing and you’re bound to think of something you missed. Having to go through this faff each time? Harrumph.

So I’ve written an update to the above script that moves the files to the correct place once they’ve been created and replaces the header with the correct YML. This means you just need to write the thing, then call the KnitPost function, and can preview changes immediately.

The full function is below - just edit the URL of the Jekyll website and the path to your local Jekyll files.

And rather than load the function each time or package it up, I’ve just stuck mine into my script so it’s loaded whenever RStudio runs. (I think I got that from here.) You’ll find the script in the Program folder of the R installation, in the ‘etc’ folder. (E.g. the path to mine is ‘C:\Program Files\R\R-3.3.0\etc’). Just copy/paste the function at the end of the script.

Then you can just write your RMarkdown script and run:

KnitPost('nameOfTheFile.Rmd','This is my lovely Jekyll article that will go to the right folder along with the figures')

The only little thing to note: it relies on there being three dashes to mark where to replace the header. By default, this is what RStudio creates if you make an Rmd script from its file menu. In theory it should be easy to code it to detect any number over three. I’ll leave that as an exercise for the reader…

Right, this has nicely reminded me how this whole Jekyll blog thing works. Now let’s start getting some less meta topics up here. Bye for now!

#Jekyll knitpost function
  #Input should be the name of the Rmd file (won't need path if it's in your project)
  KnitPost <- function(input, articleName = NULL) {
    #Adapted from 
    #Change to your site
    base.url <- c("")
    #Move the md file and figures here (to appropriate folders) after the knit process
    #Needs doing afterwards to avoid wrong path.
    myjekyllpath = 'C:/localpathtoyourjekyllsite'
    #Final article name
    #use filename if not included
      articleName <- sub(".Rmd$", "", basename(input))
    #so you're not creating the fig folder every time
    ifelse(!dir.exists(file.path('figs')), dir.create(file.path('figs')), FALSE)
    opts_knit$set(base.url = base.url)
    fig.path <- paste0("figs/", sub(".Rmd$", "", basename(input)), "/")
    opts_chunk$set(fig.path = fig.path)
    opts_chunk$set(fig.cap = "center")
    knit(input, envir = parent.frame())
    #Make edits and moves----
    #Make matching folder if doesn't exist (we may be overwriting files but not the folder)
    targetdir <- c(paste0(myjekyllpath,'/figs/',sub(".Rmd$", "", basename(input))))
    filestocopy <- list.files(path = paste0("figs/", sub(".Rmd$", "", basename(input)), "/"),
                              full.names = T)
    #Move figure outputs to jekyll folder
    file.copy(from=filestocopy, to=targetdir, 
              overwrite = T, recursive = FALSE, 
              copy.mode = TRUE)
    #Edit .md file to have jekyll-ready header----
    md <- readLines(paste0(sub(".Rmd$", ".md", basename(input))))
    #remove first line
    md <- md[2:length(md)]
    current <- md[1]
    while (current!='---') {
      #slice off the top
      md <- md[2:length(md)]
      current <- md[1]
    #Then remove that line too once while done
    #(No do-while)
    md <- md[2:length(md)]
    #Add in jekyll header. This kinda structure
    # ---
    # layout: post
    # title: "testing again"
    # date: 2016-11-25
    # comments: true
    # ---
    theDate <- format(Sys.time(), "%Y-%m-%d")
    newHeader <- c(
      'layout: post',
      paste0("title: \"", articleName,"\""),#this does work despite the output putting the escapes back in.
      paste0('date: ',theDate),#see ?date last example
      'comments: true',
    md <- c(newHeader,md)
    mdFileName <- paste0(myjekyllpath,
                         gsub(' ','_',articleName),
    #Ready to add to Jekyll folder