Introduction

In this workshop, we’re going to work through a typical day of data wrangling and visualisation using R. The day is split roughly into two:

  1. We’ll learn all the essential R concepts needed to work with data by walking through a typical script creation scenario: getting our data, processing it, deciding what we want to do with it. Starting with RStudio, this will take us through to producing our first visualisation using gglot. For this part of the day, I’ll be working through the material with you step by step, carrying out exactly the same tasks.**

Here’s a quick overview of exactly what we’ll cover:

  1. The second (probably shorter) part will be more free-form. There are four different choices for you to choose from in this half of the day. You’ll get to work through whichever you like, with help on hand (and you’ll get a break from me talking the whole time.) If there’s anything else you want to have a go at, let me know and I’ll try to help.

We’re going to try and fit a lot into one day: don’t worry too much if not all of it makes sense as long as the overall picture you get does make sense. I’ll try and make sure that everyone knows which bits they really need to have taken in.

Because we’re working through one particular scenario, we’ll only scratch the surface of what R, dplyr and ggplot can do. But it should be a solid foundation for further exploration.

We’ll check in after every chunk of work to make sure everyone’s OK with where we’ve got to. Each part of the day builds on the other - it’s important that you feel like it’s making enough sense that you don’t get lost as we move on. Please do ask if anything is confusing. Any new programming language, even for experienced programmers is always confusing. I learned Java before R: that didn’t stop R being baffling for a long while.

To maximise the benefit from today, try to find some time soon after to work on your own data. Find a few days for this, if you can. You will consolidate the information from the course much better if you work through problems that are relevant to your own work.

Our data for the day: Land registry ‘price paid’ data on house prices in England. You each have a USB datastick with a different city to work on, as well as data for London. All your work and outputs will be saved to this USB stick: take it when you go.

All of the data for the workshop is open access: you can download it yourself for free. I’ve provided links and notes at the end of this document for that.

Some random tips before we get started properly:

Right! Let’s get started.


R-Studio

A first look at RStudio

RStudio is a self-contained environment for doing everything in R. As we’ll see throughout the day, it does a heap of different things to make programming in R as painless as possible.

First thing - let’s open it and have a look at what’s there. You should see something like the following:

RStudio presents you with these panes:

  • The console pane where we can enter commands and run them immediately.
  • The environment pane: all our data and variables will appear here.
  • A pane with a number of tabs, currently open on files. When we first produce graphics, these will appear here in the plots pane.

Entering commands in R

Before we do anything else, let’s get a feel for how to enter commands in R.

We’ll use the console to do this. The console will execute anything we put here as soon as you press enter.

R will return anything you put in. Try some of these or something different just to get a feel for it, pressing enter after each. Any maths will be evaluated and returned.

80
80+20
80*20
80/20
'This is some text'
sqrt(100)

That last one - sqrt(100) - is a function. Functions are in/out machines: in this case, stick 100 in and get its square root out. Anything you put inside brackets in R is going into a function. You can stick anything in there that will evaluate, so we could also have done:

sqrt(45+55)
## [1] 10
sqrt(sqrt(16))
## [1] 2

Assigning to variables

Everything, whether it’s a single number, a list of numbers, a piece of text, an entire dataset or a graph, can be assigned to a variable.

Variable names should be a balance between brevity and clarity: you want it to say something that will make sense to you when returning to the code a month later, but it also needs to be typeable. (Although on that point, as we’ll see, RStudio removes a lot of the pain of typing for us.)

Here’s some examples to try in the console.


R’s assignment operator is a ‘less than’ followed by a minus:

Typing both of these the whole time can be a pain - so RStudio has a handy shortcut key:

  • ALT + ‘minus’

Have a go at using this shortcut key when assigning these examples:


city <- 'London'

max <- 150
min <- 45
range <- max - min

You’ll see when assigning to variable names in the console, it doesn’t automatically output what’s just been assigned. You will, however, see them appear in the environment pane on the right: your new variables are there, under ‘values’.

You can also see the assignments and results by typing the variable names, as we were doing with simple values before:

city
## [1] "London"
min
## [1] 45
max
## [1] 150
range
## [1] 105

And of course we can then use the variable names in functions:

sqrt(range)
## [1] 10.24695

We’ll get on to putting all this into a reusable script in a moment.

Opening today’s project in RStudio

RStudio projects are self-contained folders that keep everything for a particular project together in one place.

You’ve each got a USB data stick - this contains everything for today. Once you’ve opened RStudio, the first thing to do is open the project directly from the USB stick. By doing this, all of your work today will be saved to the stick and you can take it away with you.

Access RStudio’s project loading dialogue in the top right. It should currently just say ‘project (none)’. Click to get a range of project options. You can tell RStudio that an existing folder should become a project - but we’re opening one that’s already there, so choose open project:

Navigate to the data stick, open the folder and double-click the .Rproj file.

You may not immediately see much change - but now the whole workspace will be saved as a whole. As we’ll see below, now we’re in an RStudio project, we don’t have to worry about where the working directory is: RStudio sets it automatically to our project folder. As long as we’re working in that folder, there’s no need to mess around with the full path to the file. (It also allows us to move RStudio projects and give them easily to other people to use.)

Note that in the top right you can now see the project name.

Creating a new script and running code in it

We’ll program all of today’s work into a single script file in RStudio. To get a blank script to start with, go to:

  • File / new file / new R-script

Note how the menus tell you what shortcut key to use for a lot of actions. In this case we can see Ctrl+Shift+N will also create a new script.

The script opens in a new tab in a pane above the console. Currently it’s just named Untitled1 (on the tab itself, at the top). Click in the script pane to move the cursor there.

Start scripting! Just add a couple of lines, whatever you like, similar to the commands we entered into the console. Note: unlike the console these lines won’t run until we tell them to run. So just add whatever you want to add over a few lines, pressing enter to move to the next one. We’ll run the lines in a moment.

Once you’ve added something, the name (currently Untitled1 still) will turn red: you can now save it.

Save the new script either via the menu (File/save) or CTRL + S. Give it whatever name you like. Note that it will save in the top level of your project folder. RStudio will give it the extension .R

Now we can run our first few commands. In a script, we have a few ways to choose, depending on what we want to do.


A quick note about the programming philosophy of R. Where some programming languages are all about writing entire programs, compiling them and running them as a whole, start to finish, working with R is much more iterative and experimental.

R can run entire programs - that’s all the libraries are, after all - but it’s also designed for experimenting with and exploring data on the fly. We’ll be doing a lot of this in the workshop today.


OK, so here’s three options for running your script:

  1. To run a single line of code (as we were doing in the console): place your cursor anywhere on a single line you want to run. Then press CTRL + R or CTRL + ENTER. Both do the same, so whichever works for you. This will run that single line. You’ll see it echoed in the console, as well as the output of the command, same as when we ran it directly in the console.
  • There is also the option of using the run button at the top right of the script pane, but that’s generally more faff than using the keyboard.
  1. To run multiple lines of code: highlight more than one line of script, as you would highlight text in any text editor or word processor. Then, as before, use the keyboard ‘run’ commands, either CTRL + R or CTRL + ENTER.
  • You can highlight the text with the mouse, or use the keyboard. If using the mouse, you can also use the mouse wheel to scroll the script while highlighting.
  • If you’re not familiar with keyboard shortcuts for highlighting: hold down shift then use the up and down arrows to highlight a row at a time.
  • Third: you can run the entire script this way just by **selecting everything with CTRL + A (or right-click and select all) Personally, I never ever do this, which is why I’ve left it until last. When you save your RStudio workspace in the project, it will save all of the variables and progress you’ve made so far, so there is usually no reason to run an entire script from scratch every time you start work.

If you’re coming back to this later or lose the datastick, or are running through the workshop on your own, the project can also be downloaded from either of these:

  • Clone this Github page:
  • Download and unpack this zip file:

``But will it make sense in two month’s time…?’’ Using comments and sections

Code you write today may not make the slightest bit of sense in the near future. This happens to all programmers. So it’s absolutely essential to take some steps to make life easier for yourself by making sure the code is readable and clear. Leave plenty of space in your code wherever you can for a start.

But the most essential way to make sure it makes sense:

  • Comment all your code clearly
  • Don’t ever say to yourself, ‘oh, this will make sense later’.
  • So comment all your code clearly!

R uses the #hash symbol for comments. So for your few first script lines, you can add a comment or two thus (obviously, make your comments match what script you wrote!):

#Using the assignment operator
city <- 'London'

max <- 150
min <- 45

#Finding the range by substracting max from min
range <- max - min

RStudio also provides a hugely useful comment feature:

  • Adding four dashes to the end of a comment automatically makes it a section Try it - add four dashes to your first comment.
#Using the assignment operator----

As you add the dash, you’ll see a little down-pointing triangle appear on the left. Also, at the bottom of the scripting window, the same phrase should have appeared. Once we add more sections, you can click in that area to move between them.

  • Shortcut key tip: ALT + SHIFT + J brings up the section menu without having to click on it. (At the moment we’ve only got one section, mind.)

Some people also like to make their sections more visibly distinct by doing something like the comment below. It’s a matter of personal taste - whatever helps you keep your code clear.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#Using the assignment operator----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We’ll look at a few other keeping-things-readable ideas as we go through the day.


It’s all about the libraries

Commands already built into R (like the sqrt function we used above) are known as the base commands. But all the really awesome stuff in R comes from libraries. If there’s something you think you need to do in R, someone has most probably already built a library to do it.

For today’s workshop, we’ll be (mostly) using a set of libraries developed by one person: Hadley Wickham. Hadley’s designed these libraries with an underlying data philosophy - tidy data - so that as much as possible there is one clean, standardised way of doing things.

We’ll use these libraries to get us from loading and organising data right through to visualising it in the ggplot library.

If a library isn’t already installed, you can install it with the following. (This is the first library we’ll be using, in the next section.) Do this in the console, not in your script, as it only needs running once.

install.packages('readr')

Once the library is installed, you can load it ready for use with the following. This time, put this right at the top of your script, before everything else as you’ll want to load these libraries every time you come back to the script. We’ll add more libraries there shortly.

library(readr)

Notice, just to be awkward, that when installing libraries, the name needs to be in quotes but when loading it, it’s not.

As we’ll see shortly, readr gives us a nicer way of loading CSVs than base R provides.

A quick look at vectors (boring but essential for later)

In R, a vector is just a collection of numbers, characters or other R object types. This will not seem very relevant right now, but it’s really worth paying attention to. Understanding how R uses vectors helps massively with a lot of more complex activities we’ll be looking at later. Equally, not understanding how R uses them can lead to all sorts of confusion. As we go along, we’ll return to the concept of the vector and how R uses it. It’s actually very simple but easy to misunderstand initially.

Here’s how they work. Where before we were assigning single numbers…

bobsAge <- 45

… a vector of numbers just looks like this. When scripting, you indicate it’s a vector with a c, enclose the values in round brackets and separate values with commas.

Make a vector similar to this with six random ages.

everyonesAge <- c(45,35,72,23,11,19)

This should be familiar enough to anyone who’s ever worked with mathematical vectors. And it’s the same structure for any other variable type, like strings.

This time, make a vector of names - pick whatever names you like. Try and split them evenly between male and female.

everyonesName <- c('bob','gill','geoff','gertrude','cecil','zoe')

As before, if you type these variable names, you’ll see the whole vector.


RStudio provides excellent code completion that massively helps with scripting. As an introduction to this: when you start to type either of the two variables we just made, you should see an autocomplete box like this:

You can scroll through these with up and down arrows or use the mouse. Choose one now and press enter - it will be added to the code. Press enter again to run the line.

You can also use CTRL + SPACE to bring up the autocomplete box at other times.


everyonesAge
## [1] 45 35 72 23 11 19
everyonesName
## [1] "bob"      "gill"     "geoff"    "gertrude" "cecil"    "zoe"

You can also access the individual values in a vector by using its index in square brackets after the variable, where 1 is the first entry and - in this case - 6 is the last:

everyonesAge[1]
## [1] 45
everyonesAge[5]
## [1] 11
everyonesName[2]
## [1] "gill"
everyonesName[3]
## [1] "geoff"

R also has syntax for giving a range of integers. Type this directly into the console to get all numbers from 2 to 6:

2:6
## [1] 2 3 4 5 6

So we can use this to access values in our vector.

everyonesName[2:6]
## [1] "gill"     "geoff"    "gertrude" "cecil"    "zoe"

Now: an example that begins to show why vectors are so important in R. If we want to access different names in our vector of names, we do the following. Here’s all the men and women separately (note, you’ll have to check your vectors to see what index the male/female members have):

everyonesName[c(1,3,5)]
## [1] "bob"   "geoff" "cecil"
everyonesName[c(2,4,6)]
## [1] "gill"     "gertrude" "zoe"

What happened there? We used another vector of numbers to index our earlier vector. R is built on vectors in this way. As we’ve already seen that vectors can be assigned to variables, we can also do the following:

women <- c(2,4,6)
men <- c(1,3,5)

#Replace the vectors with their variable representation
everyonesName[women]
## [1] "gill"     "gertrude" "zoe"
everyonesName[men]
## [1] "bob"   "geoff" "cecil"

Another way to access the contents of vectors is using R’s boolean values - that is, the values of TRUE and FALSE. If we use a vector of TRUE/FALSE values, for example, we can mark as TRUE if any of the names are female:

everyonesName[c(FALSE,TRUE,FALSE,TRUE,FALSE,TRUE)]
## [1] "gill"     "gertrude" "zoe"

You can also use T and F as short-hand for TRUE and FALSE in R to save typing, if you want:

everyonesName[c(T,T,F,F,T,F)]

For any vector members where this is TRUE, it returns that value for us. Why should you care about this? Because if we can use TRUE and FALSE to access vector indices, we can ask questions of the vector using .

For example, if we assume our everyonesAge vector is telling us the age of the people in everyonesName, we can ask, ‘who’s over 30 years old?’:

everyonesName[everyonesAge > 30]
## [1] "bob"   "gill"  "geoff"

What just happened? Nothing more than we just did above - if you put this directly into the console:

everyonesAge > 30
## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

That just returns a vector containing TRUE/FALSE values from asking whether the ages in everyonesAge are more than 30. And as before, we can just stick that vector straight into everyonesName to find out who’s over 30.

This is the full list of logical operators that we can use to ask questions like this. Note especially the double equals: this will allow us to find exact matches for values and text.