Jul 3, 2015 - How wiggly is Britain?

Comments

For my first foray into r-markdown (and getting it more-or-less directly into a jekyll blog), let’s have a go at calculating a `route factor’ for Great Britain and seeing how it changes for shorter/longer distances. The data comes from Google’s route matrix API. I won’t cover harvesting that data - the source code and explanation for that is here along with the code below and a copy of the CSV data.

The ‘route factor’ is just the ratio of a route distance between two points over its Euclidean distance (or Great Circle equivalent, which is what I use here via the sp package). So if the distance to drive a route is twice as far as the crow flies, the route factor is 2. In modern parlance it’s known as circuity (note the handy diagram in that link) though I first picked the idea up from this rather wonderful 1979 geography textbook and now can’t stop calling it ‘route factor’.

I’d originally harvested google distance data in order just to show that, if goods were moving between random points in the UK, the distribution would be clearly different from the spatial decay present in the ‘goods lifted’ data (it is). The idea of the route factor itself became useful when I had two different sources for how far freight goods move within the UK - one was over Euclidean distance (trade between the now defunct Government Office Regions), the other a route distance given by hauliers in a survey. The two results almost, but not quite, agreed on the spatial decay of trade. If adjusted for the route factor, would they match? (Pleasingly, they do.)

This data also offers the intriguing possibility of checking how circuity changes depending on the route distance. Intuitively, one would expect that shorter routes would be more wiggly and, for larger journeys, the ability to utilise more major roads would increase the efficiency. We’ll take a look at that here.

I’ll be using sp’s spDist function to get the great-circle distance between points - in what seems like a clumsy way to me, so it’d be good to hear about better options.

So here we go. First, the libraries and data we’re using…

library(sp)
library(ggplot2)
library(Hmisc)#for binning distances
routes <- read.csv("latestrbindOfMatrixOutputs.csv")

That’s ~5.5 thousand routes between random points in Great Britain. We’ve got the following variables in ‘routes’ that Google’s matrix API provides:

colnames(routes)
##  [1] "X.1"         "X"           "distance"    "time"        "origin"     
##  [6] "destination" "origin_x"    "origin_y"    "dest_x"      "dest_y"

The API is cunning: in routesRandomiser.R, all I do is pick a completely random point within a shapefile for Great Britain and it picks the nearest actual address point - as you can see (using knitr’s nice table formatting) from the ‘origin’ and ‘destination’ fields:

knitr::kable(head(routes[,5:6]))
origin destination
11 George Street, Hintlesham, Ipswich, Suffolk IP8 3NH, UK Cholderton Road, Andover, Hampshire SP11, UK
B3223, Exmoor National Park, Minehead, Somerset TA24, UK Unnamed Road, Lauder, Scottish Borders TD2 6PU, UK
B5027, Uttoxeter, Staffordshire ST14 8SG, UK Unnamed Road, Ystrad Meurig SY25 6ES, UK
New Barn Drove, Warboys, Huntingdon, Cambridgeshire PE28 2UB, UK Unnamed Road, Ripon, North Yorkshire HG4, UK
Gibbet Lane, Market Rasen, Lincolnshire LN8 3SD, UK Unnamed Road, Pitlochry, Perth and Kinross PH9 0PA, UK
Unnamed Road, Charing, Ashford, Kent TN27, UK Wolverstone Hill, Honiton, Devon EX14 3PU, UK

So we know we’re getting actual random network routes. We’ll be using origin/destination lat/long coordinates (origin_x / origin_y and dest_x/dest_y) and, of course, distance - which needs converting to kilometres to match the upcoming sp function:

routes$distance <- routes$distance / 1000

For ease of reading, let’s subset origin and destination coordinates.

origins <- data.matrix(subset(routes, select = c(origin_x, origin_y)))
dests <- data.matrix(subset(routes, select = c(dest_x, dest_y)))

And I’ll be adding the resulting great-circle distances back into ‘routes’, so:

routes$spDist <- 0

Now for the bit doing the work. The spDistsN1 function is actually set up to find a full vector of distances between a single point and a bunch of other points. I couldn’t find a less tidy way to persuade it to give me distances between a sequence of pairs. Here’s what I managed. Cycle through each of the ‘origins’ and use that as the matrix of points it wants (it just happens to be a single point in this case, but it needs to be passed in as a matrix). The destination can then just be the single point. ‘longlat=TRUE’ returns kilometre great-circle distances.

for (i in 1:nrow(origins)) {
  
  #get a single pair, turn into matrix for spDistsN1
  #needs transposing also. R matrix/dataframe orientation confuses me! If it doubt, hit it until it works.
  orig <- t(data.matrix(origins[i,]))
  #spDistN1: first arg needs to be a matrix of points. Supplying it with a single-row matrix
  #Second argument has to be a single point.
  #If the first arg had more values it would find all dists between each of those and the single point
  #But we only want each origin-destination pair distance
  #http://rpackages.ianhowson.com/rforge/sp/man/spDistsN1.html
  routes$spDist[i] <- spDistsN1(orig,dests[i,], longlat=TRUE)
  
}

Right - we have our straight(ish)-line distances to compare to Google’s road network distances. Route factor ahoy! I’m going to also find route-factor-over-one, which will make sense in a moment.

#The ratio means "how much further is route distance than Euclidean/Great Circle?"
routes$rf <- routes$distance/routes$spDist
#"how much shorter is euclidean?"
routes$rfoverone <- routes$spDist/routes$distance

We can get a decent graph of the route factor, but the direct numbers are much more amenable to a histogram if we’re asking “how much shorter is Euclidean distance?” Thusly:

hist(routes$rfoverone, breaks=30)
abline(v=mean(routes$rfoverone),col="red")

center

So - how does circuity change for the size of journey taken? To find out, let’s make a function that will stick a range of distances into bins of equal sizes, so we can easily show the difference that the number of bins makes. “Cut2” does the binning, providing an index we can use.

rfbins <- function(data, bins) {
  
    data$distbins <- as.numeric(cut2(data$spDist, g=bins))
    
    #vectors of zeroes for the means
    distmean <- rep(0,bins)
    rfmean <- rep(0,bins)
    
    rfhypo <- data.frame(distmean, rfmean)
    
    #Use these bins to get average distance...
    rfhypo$distmean <- tapply(data$spDist, data$distbins, mean)
    #... and average route factor
    rfhypo$rfmean <- tapply(data$rf, data$distbins, mean)
    
    return(rfhypo)
    
}

This function gives us the average route factor in each of a set number of equal-size distance bins, using the average distance in each to give us matching row numbers (“g” in the cut2 function splits into equally-sized quantiles). Let’s see what ten bins looks like to start with:

rfhypo <- rfbins(routes, 10)

plot(rfhypo, xlab="distance", ylab="mean route factor")
lines(rfhypo, col="green")

center

Boom! Circuity is higher for shorter distances. For higher bin numbers, though, the drop-off isn’t quite so smooth. I suspect this is more likely to be due to the number of route samples than the underlying pattern.

rfhypo <- rfbins(routes, 30)

plot(rfhypo, xlab="distance", ylab="mean route factor")
lines(rfhypo, col="green")

center

A few thoughts to end on:

  • In comparison to some of the circuity numbers quoted in that wikibooks entry, this route factor looks rather high. It makes me fret about coordinate systems. I’m pretty sure the original data is lat-long, but… yes, worth a sanity check. The article does mention two points, though:
    • “The measure has also been considered by Wolf (2004) using GPS traces of actual travelers route selections, finding that many actual routes experience much higher circuity than might be expected.”
    • “Levinson and El-Geneidy (2009) show that circuity measured through randomly selected origins and destinations exceeds circuity measured from actual home-work pairs. Workers tend to choose commutes with lower circuity, applying intelligence to their home location decisions compared to their work.”
  • The randomness in the Google matrix API routes will also mean many short distances in non-urban areas, where I’d expect circuity to be much higher (something that should actually be easy enough to test). Though that wouldn’t explain it bottoming out at a rather high ~1.35.

One last graph: the distribution of circuity in each of the bins shows a little more what’s going on, certainly at shorter distances. Let’s add the bins directly to the original routes data to see:

bins = 10

routes$distbins <- as.numeric(cut2(routes$spDist, g=bins))

ggplot(routes, aes(factor(distbins), rf)) +
  geom_boxplot()

center

Whoo - some outliery craziness going on there. Route factor of more than six, you say? Ah, that’d be this. Despite having asked the API “avoid=ferries”, it appears to be taking some routes across the sea by car. So a closer look at the data perhaps required. Restricting random routes to England and Wales might also be an idea.

So there we are, first rmarkdown experiment over. That was quite pleasing if rather time-consuming. The process certainly couldn’t be easier though. (I’ve said that before attempting to transfer to Jekyll. Let’s see how that goes…)

Jekyll note: as I didn’t have one, I nabbed this syntax highlight stylesheet via here.

Nov 26, 2014 - Money flows in the UK

Comments

This is one of the fun things I coded up in the process of developing the last grant I worked on. I’ll explain a bit about it and then share some thoughts on whether it’s any good as a visualisation. There’s a sharper HD version of this video here and a dist.zip file on the github page if you want a play.

Your standard input-output table takes a bunch of economic sectors and, in a matrix, gives the amounts of money flowing between each of them. For the UK, we’ve got ‘combined use’ matrices that include imported inputs moving between sectors, as well as domestic use only, excluding imports. (These two work with different types of prices, though, so they’re not directly comparable.)

This is the boiled-down version of the data I use, from the first data link above: the 2012 combined use matrix. Github gives you a scroll bar at the bottom to view the whole CSV file. The sector names are only in the first column, but they also apply to each column heading along the top. So, for example, the first number column starting with 2822: this is what ‘agriculture, hunting, related services’ spends on other sectors. So the first value is what agriculture spends on itself (it’s in millions of pounds; the matrix diagonal gives the amounts each sector spends on itself.) This is a tip from Anne Owen that’s always helped me: think of each column as a receipt of what that sector has bought. So summing the receipt gives you that sector’s total consumption. Summing each row gives you its total demand - how much others buy from it.

The visualisation shows what this matrix looks like if you stick it into a force-based layout and make each money flow a moving circle. The live version is interactive, allowing you to explore sector connections.

So: any good as a visualisation? Before I’d produced it, I would have said, mmmm - not really. It’s fun to play with but doesn’t really convey information. It does manage to give a quick overview of the relative size of sectors and how much money moves between them, but you can’t ask it any useful quantitative questions. I’ve since learned a lot more about the internal structure of these IO matrices using R - perhaps that’s something I’ll come back to. I have also coded a ‘random walk centrality’ test (that code is in the source files, though it’s turned off at present) - so it’s certainly possible to use the network structure to do some analysis.

Something unexpected happened with this visualisation, though. It engaged people. Prior to this, I probably wouldn’t have thought that was an important thing but, looking back, having something like this that’s able to draw someone in - that’s turned out to be very useful. One of my colleagues used it in a tutorial and apparently they were really taken by it.

That kind of initial hook can be enough to make someone want to find out more. That’s been a useful lesson for me. If I were drawing up a criteria list for successful visualisations, this one’s made me think of adding ‘engagement value’ or ‘hook power’ or somesuch. This IO viz has plenty of that. I think it manages to give an impression of the economy as a whole that would otherwise be hard to see. (Though there are reasons to distrust the picture it paints: it tells you construction is by far the biggest sector - it wasn’t until 2013, when ONS took three separate construction sectors and combined them.)

But another visualisation criteria should, of course, be ‘does it communicate information effectively?’ This doesn’t manage so well. Perhaps the ideal is to maximise communication / information / hookiness. Perhaps there’s a trade-off there too - making something that might initially make a person go ‘wooooo’ will probably mean, after a few minutes, they’ll realise it’s a bit meaningless.

Even so: prior to this, it would never have occurred to me that hookiness could be useful in itself. For the grant, this viz helped me say: “look, these are the money flows moving in the UK. We want to want to know where in the UK they move”.

This is also a good example of why I still like Java. There’s a lot of work going on there - it would likely run unuseably slow in javascript. This takes us straight back to the ‘wooo/information’ trade-off though. One might argue the computationally intense stuff it’s doing is useless for conveying information - and including it, insisting on a more powerful codebase, is cutting it off from an easily accessible home on the web.

Nov 17, 2014 - *tap tap*... is this thing working?

Comments

Yay! I’ve got myself jekyll’d up and am githubified. I wish I’d got my noggin around git a little earlier - it would have helped the progress of the PhD modelling no end (thesis and models available here; explanation of the PhD’s origins here. I’ll be adding more more user-friendly chunks of code and blog from the thesis).

As well as discovering how git makes my life easier when coding/writing (I have branches for the code behind every figure, yay!), I’m having a go at using github to bring everything together and make it visible in one place. Whether in or out of academia (currently out!), I’m hoping it’ll be a good way to interact with others on the projects I want to pursue.

Visualisation has been at the heart of a lot of my coding through the PhD and beyond. It was partly through visualising and interacting with my agents in the PhD model that I learned to understand its dynamics, in a way that led directly to more robust mathematical insights. I also discovered that visualisations I found useful as a modeller didn’t necessarily translate into good figures for communicating to others. Quite the opposite, in fact.

I discovered this the hard way: all the thesis figures needed re-doing post-viva. The paper-friendly outputs I now have are both interactive and much better on the printed page. (For a start, I spent a lot of time making decent black-and-white outputs, to avoid the ‘these look good in colour if you look at the online version’ problem). Most importantly, I’ve got a better handle on how to achieve these different visualisation aims.

That whole process led me to ponder the nature of visualisation: here’s a post of mine on the topic, where I wonder about the difference between a box-plot and an x-ray. I think my own original PhD visualisations were closer to x-rays than boxplots: images that I’d learned to interpret through intense interaction and feedback over a long time. Good for my understanding - utterly, hopelessly useless for communicating with others.

It was fascinating to see the same process at work recently in Christopher Nolan’s Interstellar: physicist Kip Thorne gained new insights into the workings of black holes when Nolan’s FX team used his maths to create a visualisation. It produced some light effects that were initially thought to be a coding error. In fact, they’d done such a good job implementing Kip Thorne’s maths, it was showing something he’d not predicted - but emerged naturally from his work. He plans to get two papers off the back of the discovery.

‘Visualisation’ is a bit of a hideous word. The aim is to show - or to provide a path for someone’s mind to dig into the nature of something in a way they otherwise would struggle to. Bill Phillip’s MONIAC is a stupendous, pre-computer example of this. Tim Harford introduces his latest economics populariser with the machine - you can read most of this via Amazon’s ‘look inside’ option. (Harford also tells something of Phillips’ incredible life story; hear more about this, and see the machine in action, in this Cambridge University video, using a MONIAC machine restored by Allan McRobie). In Harford’s telling of the tale, the amused derision of other economists, faced with this pipe-and-water contraption, quickly turned to amazement on seeing the thing function. It’s easy to forget: prior to this machine, the economy as a system had never been `visible’. Some, perhaps, had mental models of superior power - but even then, I’d argue there’s nothing quite like the tactile experience of feedback to get under the skin of a model.

The MONIAC machine seems to have become a little more talked about since the crash. Harford uses it to set the stage for macro-economics. Diane Coyle, on the other hand (in GDP: a brief but affectionate history) uses MONIAC as an exemplar of the ‘engineering mindset’ - a physical manifestation of the “illusion of precise control” (p.21). Phillips apparently did later work on the possibility of actually damping economic oscillations - the pursuit of precise control was certainly on the minds of economists wanting to avoid another Great Depression. But this nicely illustrates how models polarise views about our ability to understand society, let alone steer it. (And as I mention here, the ‘engineering mindset’ Coyle identifies is, to some of an Austrian bent, a slippery slope to totalitarianism).

MONIAC’s users were aware of its limitations (as Newlyn said, “hydraulics is no substitute for economics”) - but there’s a question here, one that I haven’t answered for myself satisfactorily, about how people and society are shaped by the models they use. My favourite self-aware modeller, Paul Krugman is very clear in his views:

Whenever somebody claims to have a deeper understanding of economics (or actually anything) that transcends the insights of simple models, my reaction is that this is self-delusion. Any time you make any kind of causal statement about economics, you are at least implicitly using a model of how the economy works. And when you refuse to be explicit about that model, you almost always end up - whether you know it or not - de facto using models that are much more simplistic than the crossing curves or whatever your intellectual opponents are using. Think, in particular, of all the Austrians declaring that the economy is too complicated for any simple model - and then confidently declaring that the Fed’s monetary expansion would cause runaway inflation.

He goes on to show that didn’t happen. But he’s also well aware that models shape the way one views reality - and, as I always go on about, J.C. Scott is right to say that if the viewer has power, as governments do (or corporations, for that matter) - models shape their view of reality and then reality is shaped by their actions.

Which is all a rather long way from the piffling visualisation code I’ll be sticking up on the github page - but I hope I’ve made the point that how we use and understand models and visualisation is important. It’s only going to get more important too. Perhaps finance houses will soon have traders hooked up to oculus rifts doing the full neuromancer, linking with their AI systems centaur-fashion. Bloomberg are already producing mock-ups of the first basic step - giving them access to a huge array of virtual, leap-motion controlled terminals.

Back to the mundane matter of actually coding: pretty much everything I’ve done so far has been in Java (though I’m getting the hang of R and it’s really rather good!) with the visualisation working through the Processing libraries, building in Netbeans. I need to get out into the messier, noisier world of web development and javascript. Until relatively recently, it was still easy enough to showcase Java through applets on the web - this has become increasingly impossible. Java’s still got plenty of life in it, of course - but it’s not much use if you want to present via the web.

That’s deeply annoying. I can show a bunch of screenshots and ask people to download and run a jar. I used to just provide a link. Showcasing java visualisations, then, is now an official pain in the arse. I still plan to use it - for some of the things I’d like to do, its speed still wins. But for communicating? For some projects, maybe - but not on the web. Hopefully this blog will help me change that. Processing (the Java based graphics package I use a lot) does have an option for exporting as javascript (here’s an example of one of mine running in the browser) but I’m a long way from feeling as comfortable with web development generally as I do with Java.

Nowt wrong with being outside yer comfort zone, though, eh? So let’s see what happens…