The aim of this session is to introduce you to the basics of coding within R. To do this, we will introduce you to the basic syntax, via the use of some simple commands such as how to read in data and plot. We will end with the introduction to methods for ODE solving in R.
In this session, you will
The best thing to do is to read each section and type (or copy and paste) the R commands (in grey boxes) into your own R session to check that you understand the code. All the data to download is in the github repository (links to download any data files are given below when you are supposed to load them).
You can either work in a single R window, or use the user friendly interface RStudio, which allows you to have your saved R code, separate to your executed commands. It also holds the plot, environment and history window. Using RStudio you can keep your model code (in a file called ‘file.R’, where you can replace ‘file’ with any name you like) separate to your executed code in the Console window. In this ‘file.R’ you can store the useful and checked code for future use. For example, you could start a new R file and store in it all the commands below.
Within R code, anything behind a #
is “commented” out and won’t be executed. It is good coding practice to comment your code as you go through.
# This is commented code.
You can use this in your R file to comment what the code is doing in your own words.
The first step when creating a new R file is to specify a working directory. This is the folder on your computer in which you are working. It is likely to be the folder in which the R code is saved (but it doesn’t have to be). Each computer will have a different specified address for the working directory. Upon opening R, you can find out “where you are” using
You need to tell the R session where you want it to work. To do this you use the command to “set the working directory” or setwd()
.
For example, on Gwen’s computer this is:
You can also label this working directory as ‘home’, and have other locations stored for plotting or for getting data. For example
or
The following symbols represent
*
+
-
->
To check if something is equal use ==
, or not equal using !=
.
Below are examples of the main mathematical structures that we will use in R. Copy the below text into your R session and change the numbers so that you can follow what each command is doing.
The basic element is the vector:
You can also construct and manipulate matrices:
m <- matrix(0, 4, 2) # matrix of zeros of size 4 x 2
m # let's look at m
m[1, ] # first row
m[, 1] # first column
m[, 1] <- c(2, 3, 9) # Error - what does it say?
m[, 1] <- c(2, 3) # What happens?
m # as multiple then works
m[, 1] <- c(2, 3, 9, 12) # Assign column 1
m[, 2] <- v # Assign column 2 to be the vector we had earlier
Can you generate a matrix of size 10 x 5, filled with 2s except for the first column that has the sequence 1 to 10?
Simple commands such as ‘sum’ are already encoded in R.
Other types of allocations include:
To find help in R you can use several different commands. For example, if you wanted to know more about the “sum” command you would type:
The first two are equivalent, whilst the last will do a full text search of all R’s help files. This is most useful when you aren’t sure of what the exact function is called in R. In RStudio, there is also the option to search the help tab in the bottom right window. Google is also very useful for finding commands.
Can you use ??
to find out how you would generate binomial random numbers in R?
As R is open source, it is being developed and added to all the time. It is likely that you may need commands that are not in the standard R programme that you have installed.
To get access to these other functions, you need to install ‘packages’. These contain ‘libraries’ of functions, already coded into R, that can perform new or different functions.
There are two ways to do this: 1) In RStudio, go to the “Packages” tab and search for the required package and load it. 2) Type
replacing “packagename” with the name of the package you would like to install. For example, can you install the package deSolve
? We will use this to solve ordinary differential equations later on in this practical.
Once installed you then have to load the package at the start of the R file. This is done using
In order to read in data, the first thing to do is to check that you are pointing R at the right folder by setting the appropriate working directory. To do this make sure you setwd(where_the_data_is)
, to point R to wherever you have decided to store your files.
For this practical, you will first need to download the required data from (in CSV format) from here. Save it in the correct folder and make this your current working directory.
There are many ways to read in data. Here, as we have a ‘.csv’ file we could use either of:
mydata <- read.table("data.csv", header = TRUE, sep = ",") # Comma seperated...
flu_hh <- read.csv("data.csv") # R doesn't like XLS files... likes csv files... (flu in hammersmith hospital)
Here both mydata
and flu_hh
are the same data. Can you check this in R?
The data is read in in the form of a data.frame
. This is a type of data format in R that allows the inclusion of both numbers and strings (letters). There is more information on reading in data here.
There are several useful tools to look at the data we have read in. Copy the below code into your R session and check that you understand all the commands.
flu_hh # All the data - careful if big file!
dim(flu_hh) # how big is it
head(flu_hh) # Top of the data
tail(flu_hh) # Bottom of the data
flu_hh[, 1] # First column
flu_hh$num_inf # or by name (as data.frame). The $ allows you to refer to a specific column.
flu_hh[1, ] # First row
colnames(flu_hh) # Column names
times <- flu_hh[, 1] # Grab the times in the data
times # All the times
times[2] # The second time
How would you go about finding the maximum time in this data? At what time is the number of cases at its maximum? See if you can write code to do this yourself. To do this you may need to use the function which
that searches for the location of matching values. The solution to this task is below.
max_time <- max(times) # Maximum time
# At what time is the number maximum?
max_num <- max(flu_hh[, "num_inf"])
w <- which(flu_hh[, "num_inf"] == max_num) # very useful function which! searches through for matching entries and gives index back
time_max_prop <- flu_hh[w, "time"]
# OR
time_max_prop <- flu_hh[which(flu_hh[, "num_inf"] == max_num), "time"]
To add another column to a data.frame
you simply refer to the name of this new column and assign it to have certain values. For example, to add in a column of percentages:
R has inbuilt useful functions for creating plots. Again, copy and paste the below and check the differences in the output. The command plot
creates a new plot window which can then be added to using lines
or points
. By default plot
will create a plot with points, to alter this use the type="l"
command as shown below.
plot(flu_hh[, "time"], flu_hh[, "num_inf"]) # Points
lines(flu_hh[, "time"], flu_hh[, "num_inf"]) # add lines to the same plot (and vice versa for points)
plot(flu_hh[, "time"], flu_hh[, "num_inf"], type = "l", col = "red", xlab = "Time",
ylab = "Number") # Label axis
You can also alter the size of the axis to focus on certain areas. For example, use the max_prop
from above to focus on the increase in the outbreak only.
plot(flu_hh[, "time"], flu_hh[, "num_inf"], type = "l", col = "red", xlab = "Time",
ylab = "Number", xlim = c(0, time_max_prop))
In order to save plots, again you need to check you are in the correct working directory. Then open a file and insert the plot like this:
jpeg("flu_hh.jpg") # or pdf etc.
plot(flu_hh[, "time"], flu_hh[, "num_inf"], type = "l", col = "red", xlab = "Time",
ylab = "Number") # what do you want to go into the file
dev.off() # magic plot command - if things break do this until error
You can also plot multiple plots together. For example, using the par
command, you can divide the window and then place the plots within it.
pdf("all_plot.pdf", height = 8, width = 8)
par(mfrow = c(2, 2)) # if want to plot several plots on to one window in rows x column formation e.g. here 2 x 2
plot(flu_hh[, 1], flu_hh[, 2]) # Points
plot(flu_hh[, "time"], flu_hh[, "num_inf"]) # Points
lines(flu_hh[, "time"], flu_hh[, "num_inf"]) # add lines to the same plot (and vice versa for points)
plot(flu_hh[, "time"], flu_hh[, "num_inf"], type = "l", col = "red") # Red line
plot(flu_hh[, "time"], flu_hh[, "num_inf"], type = "l", col = "red", xlab = "Time",
ylab = "Number") # Label axis
dev.off() # stops the par command - one window per plot
dev.off() # stops the pdf command
In order to output data, there is a write equivalent to read (again make sure you are in the appropriate working directory):
write.table(flu_hh, "flu_hh.txt", sep = "\t") # Save as tab delimited
write.csv(flu_hh, "flu_hh.csv") # or as csv
A quick note here on “not a numbers”. In R these are NA
or NaN
. This is usually an error or the result of a silly division.
Conditionals are built in functions that allow you to compare and contrast elements in R. For example, the which
function gives you the index of values that match some condition:
which(flu_hh$time > 60) # index of values that match some condition
which(flu_hh > 60) # careful if matrix
flu_hh[19, ] # Problem
which(flu_hh == 50, arr.ind = TRUE) # if want row and column
which(flu_hh > 60, arr.ind = TRUE) # if want row and column
if
statements allow you to compare elements
if (2 > 3) {
print("World gone mad")
} # Use for error checking. Print does exactly that - gives output into console
or check that data had been read in ok:
or check that what you are calling a proportion sums to 1:
flu_hh$prop <- flu_hh$num_inf/sum(flu_hh$num_inf)
if (sum(flu_hh$prop) != 1) {
print("Error: proportion sums to greater than 1")
} # != means 'does not equal'
if
statements can be combined (as in Excel) with else statements. These can be simple:
x <- -1
sqrt(ifelse(x >= 0, x, NA)) # Only take the square root if x is positive
x <- ifelse(mean(flu_hh$prop) > 0.5, max(flu_hh$prop), 0) # only assign a maximum to x if it is bigger than 0.5
or more complex:
if (max(flu_hh$time) < 30) {
mean_flu_hh <- mean(flu_hh$num_inf)
print("< 30")
} else {
w <- which(flu_hh$time < 30) # Which are those < 30 days, most recent data perhaps
mean_flu_hh <- mean(flu_hh[w, "num_inf"]) # mean of only those values
print("> 30")
}
Some useful commands in R to help with checking code such as the if/else statement above are:
INDENT: “Command + I” on OS X, “Ctrl + I” on Windows. Highlight all code and hit enter. Helps find mistakes and lays out code nicely
COMMENT: “Command + C” on OS X, “Ctrl + Shift + C”. As above but makes all code selected commented. Useful if you want to comment out a load of code and run without it temporarily.
For those of you who have coded before, you will know the importance of for
loops. For those who are new to coding, these structures are very useful in allowing you to repeat a set of commands multiple times to different elements.
For example, you can use for
loops to generate the two times table
Or to fill a vector with the first 50 Fibonacci numbers:
nf <- 50 # a number that we can then change easily that is used several times later.
# Tip: easier to put in one place than change multiple
fibo <- matrix(0, 1, nf) # Empty vector to fill
fibo[1] <- 0
fibo[2] <- 1
for (i in 3:nf) {
fibo[i] <- fibo[(i - 2)] + fibo[i - 1]
mm <- i/2 # what will happen to mm over time? be careful not to overwrite things
}
plot(seq(1:nf), fibo, type = "l")
In the above example, we built an empty vector to fill within the for
loop. However, we may not know the end size of the resulting vector and so in this case can instead concatenate:
Functions are useful ways of packaging up pieces of code. If you are going to do a set of commands multiple times then it can be simpler and cleaner to place them in a function that has been thoroughly checked and that you have not altered subsequently.
To build a function you use the syntax
Within the ‘code’ section you need to include the main coding information, which uses the inputs and also ‘returns’ an output.
For example, if Gwen were to build her own function for calculating the mean she might do:
gwen_mean <- function(vector, times) {
sum_vector <- sum(vector)
length_vector <- length(vector)
gwen_mean <- sum_vector/length_vector
gwen_mean_power <- gwen_mean^times
return(list(gwen_mean = gwen_mean, gwen_mean_power = gwen_mean_power))
}
To run the function you need the function name and the correct inputs:
You can store all the functions that you have in one big .R file and load them into your current R file using the command source
.
Can you understand what the following function does?
find_day_many <- function(matrix) {
u <- unique(matrix$ward) # What wards are there?
days <- c() # store in here
for (i in 1:length(u)) {
w <- which(matrix$ward == u[i]) # find which rows of the data are for this ward
r <- which(matrix[w, "number"] > 1) # find which rows for this ward have more than 1 person
days <- c(days, matrix[w[r], "time"]) # when are there 2 or more infected?
}
return(days)
}
Does this give you the output you expected? You will need to use some of the commands for looking at data that were introduced above. The file ward_data.csv
can be found here.
In order to write and solve ODEs, you need to learng the basic ode
syntax in R and load the package that we installed earlier (deSolve
). Remember that to do this you need to type:
The package deSolve
includes a function ode
that solves ODE equations. This function needs certain inputs. Have a look at
and see if you can understand what the required inputs are.
For this part of the practical, we will build and solve our own SIR model. We will compare it to data and try to determine the correct parameters.
Firstly we need to read in the data, which is stored in “data_sir.csv”. As before, download this from here and point R to the correct working directory where you have this data saved. Then you can type:
Can you make a simple plot to check what this SIR outbreak looks like?
As described in the help pages for ode
, this function itself requires several inputs including another function.
The first input it requires are the initial conditions. We have to specify the initial conditions for all of the states. Here we have three states, S, I and R:
Can you alter this initial vector for a population of 100,000 to have initially one infected and the rest susceptible?
The second required input are the times over which the output for the ODE is wanted. Note this is not the same as the timesteps over which the ODE will be solved. These are specified by the method. Here lets assume lets say we want output every year for 40 years.
The third required input is the function itself, which governs the dynamics between the states. Here as we have an SIR model we can translate the differential equations into three simple relationships:
sir <- function(time, state, parameters) {
# {{Curly brackets indicate the beginning and end of functions}} This is the
# meat of the function - what happens to the parameters. Here it needs to be
# a list}}
with(as.list(c(state, parameters)), {
dS = -beta * S * I # The change in S
dI = beta * S * I - gamma * I
dR = gamma * I
return(list(c(dS, dI, dR))) # {{'Return' as in 'spit out' the values that you want}}
})
}
Finally, the ode
function requires the input parameters, which here are only beta and gamma:
In order to run the function we input all of the above and save the output in a vector called out
:
init <- c(S = 1e+05 - 1, I = 1, R = 0)
out <- ode(y = init, func = sir, times = times, parms = parameters)
head(out)
What does the output data look like? Does it make sense? Can you build a simple plot to look at the change in the number Infected over time?
To look at all of the model output we can plot everything on the same graph:
plot(out[, "time"], out[, "S"], type = "l", lwd = 10, col = "green", xlab = "Time",
ylab = "Number")
lines(out[, "time"], out[, "I"], col = "red", lwd = 10)
lines(out[, "time"], out[, "R"], col = "blue", lwd = 10)
legend(25, 40000, c("Susceptibles", "Infecteds", "Recovereds"), lwd = 10, lty = 1,
col = 2:4) # Add legend and specific position
Does this match the data you read before in using read.csv
well? Note, that the data given is the proportions infected over time. So in order to compare you either need to convert the data to numbers or the numbers to data. Try and do this, and plot the data and model output on the same graph. A solution is given below if you get stuck.
What we are doing here is assessing our model’s “fit” to the data. This is a key stage in the development of any model - does it refect reality? The simplest way to do this is by plotting model output and data together and comparing them by eye. Over the next few days, we will introduce several more formal and quantitative methods for assessing a model “fit”.
data$number = data$proportion * 1e+05
plot(out[, "time"], out[, "S"], type = "l", lwd = 5, col = "green", xlab = "Time",
ylab = "Number")
lines(out[, "time"], out[, "I"], col = "red", lwd = 5)
lines(out[, "time"], out[, "R"], col = "blue", lwd = 5)
legend(25, 80000, c("Susceptibles", "Infecteds", "Recovereds"), lwd = 5, lty = 1,
col = 2:4) #
points(data$time, data$number)
Assume you know that the value for beta
is 2/100000. Can you write a function that considers a range of different parameters for gamma
, runs the model with these various values and then plots the outcomes with the data? From this what is the best estimate you can get for gamma
?
Note that this was output generated with a set value for gamma and so you can find a solution - however in the real world epidemic there is unlikely to be such a perfect fit!
The solution to the code is below, but try not to look at this before having a go yourself. This will use all the skills you’ve learnt in R above and will prepare you for the rest of the course.
# Run for many values of gamma
gamma_values <- seq(0.2, 0.8, 0.1) # vector of possible gamma values
store_I <- matrix(0, length(gamma_values) * length(times), 3) ### Empty vector where we will store the output
# Run through and solve with each gamma value Length = how many entries in
# vector What gamma value in this run?
for (i in 1:length(gamma_values)) {
gamma_new <- gamma_values[i]
# New parameter set
parameters <- c(beta = 2/1e+05, gamma = gamma_new)
# Run with this new parameter set
out <- ode(y = init, times = times, func = sir, parms = parameters)
# Store the number of infecteds this produces
stored <- cbind(out[, "time"], out[, "I"], gamma_new) # Bind the columns with time, the infected and the new gamma
store_I[((i - 1) * length(times) + 1):(i * length(times)), ] = stored # Store in the big matrix
}
# Have a look at what you have...(again wouldn't usually include this in the
# saved R code)
head(store_I)
colnames(store_I) <- c("Time", "Infecteds", "gamma_new")
# Plot
plot(store_I[, "Time"], store_I[, "Infecteds"], type = "l", xlab = "Time", ylab = "Number")
points(data$time, data$number) # compare visually to data
You can store your data for the SIR solution at different gamma values using (as above):
In the lecture yesterday, you saw the infection cycles that occur when you use an SIR model with births and deaths. If you have gotten this far, try to build and solve using ode, your own version of this SIR model with births and deaths.
Remember that the set of ordinary differential equations for this is: \[\frac{dS}{dt} = - \beta \frac{I}{N} S + \nu N - \mu S \\ \frac{dI}{dt} = \beta \frac{I}{N} S -\gamma I - \mu I \\ \frac{dR}{dt} = \gamma I - \mu R \].
Initial parameters from the lecture were: \[ \beta = 500, \gamma = 50, \nu = 0.02 \].
Remember that the population size should stay constant, at say 1,000.
If you have time, you can now try to build more complex versions of the SIR model to match the infectious disease you are interested in. For example, you could try and include multiple strains of influenza or extend the model to include some of the complexity of TB. The first stage would be to sketch the model structure on paper. Remember that simplicity is the key and that you will need data to inform any parameter values. Secondly, translate the model structure into differential equations. Thirdly, translate the differential equations into a new function to use in the ode solver function.
Top: Index Next: Code a stochastic model