Chapter 9 Some useful tools in R

Learning goals for this lesson

  • Get to know some neat tools in R that can make coding more elegant - and easier
  • Get introduced to the tidyverse
  • Learn about loops
  • Get to know the apply function family

9.1 An evolving language - and a lifelong learning process

The R universe is a very active space, with lots of improvements being made all the time in various places. Through these improvements, the language has evolved far beyond the relatively basic capabilities of base R. When I started learning R around 2010, I solved most of my problems with base R functions. This often resulted in convoluted code and ugly plots… I’d like to believe this was because the more advanced functions weren’t available yet, but the real reason is that my personal learning curve hadn’t caught up (and still hasn’t caught up) with the true state of the art in R programming.

Over the years, I have gradually come around to adopting some of these more modern tools and more elegant programming styles. Since we’ll be using some of these throughout the remaining chapters, it’s now time for an introduction. For all the tools in this chapter, there are much better and more comprehensive instruction materials elsewhere on the web (I’ll provide pointers), but I’ll try to give you the basics you need in order to follow the materials in this book.

9.2 The tidyverse

Many of the specific tools I want to introduce to you are part of the tidyverse, a set of packages developed by Hadley Wickham and his team. The whole collection is described here. I have only scratched the surface of this, but I encourage you to delve into this treasure chest to look for ways to improve your programming capabilities. Here, I’ll only highlight the functions that are used in this book. A nice feature of the tidyverse is that we only have to load a single package to access all the tidyverse capabilities: library(tidyverse) does the trick.

9.3 The ggplot2 package

We’ve already encountered ggplot2, so I’m just listing this here for completeness. Initially released in 2007 by Hadley Wickham, ggplot2 has become one of the most popular R packages, because it greatly facilitates making attractive figures. You can read up on the history of the package here.

A great introduction to ggplot2 and links to various tutorials etc. can be accessed here.

9.4 The tibble package

A tibble is an advanced version of a data.frame, which includes several improvements. These are described here. The most relevant improvement in my view is that tibbles don’t follow the classic data.frame habit of converting strings to factors at times when you don’t expect it. I’m fairly new to tibbles myself, but I’ll try to use them throughout the remainder of this book.

You can easily create a tibble from a normal data.frame (or a similar structure) by using the as_tibble command.

library(tidyverse)

dat <- data.frame(a=c(1,2,3),b=c(4,5,6))
d <- as_tibble(dat)
d
## # A tibble: 3 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2     5
## 3     3     6

9.5 The magrittr package - pipes

The main thing magrittr adds is a structure to organize workflows that are applied to the same dataset. A data structure such as a tibble can be subjected to one or multiple operations organized in a pipe. The notation for such a pipe is %>%.

For instance, we can calculate the sum of all numbers in the tibble d we created above by the following operation.

d %>% sum()
## [1] 21

Note that we didn’t have to pass the d to the sum command as an input. After a pipe, the following function always assumes that the first input to the function is the product received through the pipe. You can add more commands by adding another pipe after the first one. We’ll get to some more complex - and more useful - examples below.

9.6 The tidyr package

tidyr provides useful functions for organizing your data. I’ll use the KA_weather dataset from chillR to demonstrate how some of these work.

library(chillR)

KAw<-as_tibble(KA_weather[1:10,])
KAw
## # A tibble: 10 × 5
##     Year Month   Day  Tmax  Tmin
##    <int> <int> <int> <dbl> <dbl>
##  1  1998     1     1   8.2   5.1
##  2  1998     1     2   9.1   5  
##  3  1998     1     3  10.4   3.3
##  4  1998     1     4   8.4   4.5
##  5  1998     1     5   7.7   4.5
##  6  1998     1     6   8.1   4.4
##  7  1998     1     7  12     6.9
##  8  1998     1     8  11.2   8.6
##  9  1998     1     9  13.9   8.5
## 10  1998     1    10  14.5   3.6

9.6.1 pivot_longer

We already encountered the pivot_longer function in the previous lesson. We can use this to transfer data from separate columns (e.g. Tmin and Tmax in this case) into distinct rows. In this example, we’ll have one row containing Tmin and one row for Tmax for each day of the record. We’ll often have to do this, for instance, when we want to use the ggplot2 package for plotting our data. Here’s how this works (with a pipe).

KAwlong <- KAw %>% pivot_longer(cols=Tmax:Tmin)
KAwlong
## # A tibble: 20 × 5
##     Year Month   Day name  value
##    <int> <int> <int> <chr> <dbl>
##  1  1998     1     1 Tmax    8.2
##  2  1998     1     1 Tmin    5.1
##  3  1998     1     2 Tmax    9.1
##  4  1998     1     2 Tmin    5  
##  5  1998     1     3 Tmax   10.4
##  6  1998     1     3 Tmin    3.3
##  7  1998     1     4 Tmax    8.4
##  8  1998     1     4 Tmin    4.5
##  9  1998     1     5 Tmax    7.7
## 10  1998     1     5 Tmin    4.5
## 11  1998     1     6 Tmax    8.1
## 12  1998     1     6 Tmin    4.4
## 13  1998     1     7 Tmax   12  
## 14  1998     1     7 Tmin    6.9
## 15  1998     1     8 Tmax   11.2
## 16  1998     1     8 Tmin    8.6
## 17  1998     1     9 Tmax   13.9
## 18  1998     1     9 Tmin    8.5
## 19  1998     1    10 Tmax   14.5
## 20  1998     1    10 Tmin    3.6

As you can see, we had to specify the columns that we wanted to stack up. Note that pivot_longer fulfills a similar function to the melt function of the reshape2 package, which I used until recently (and in earlier versions of this book). I find pivot_longer more intuitive, so I’ll be using this throughout the remaining chapters.

9.6.2 pivot_wider

We can also do an opposite conversion to the one implemented by pivot_longer by using the pivot_wider command.

KAwwide <- KAwlong %>% pivot_wider(names_from=name) 
KAwwide
## # A tibble: 10 × 5
##     Year Month   Day  Tmax  Tmin
##    <int> <int> <int> <dbl> <dbl>
##  1  1998     1     1   8.2   5.1
##  2  1998     1     2   9.1   5  
##  3  1998     1     3  10.4   3.3
##  4  1998     1     4   8.4   4.5
##  5  1998     1     5   7.7   4.5
##  6  1998     1     6   8.1   4.4
##  7  1998     1     7  12     6.9
##  8  1998     1     8  11.2   8.6
##  9  1998     1     9  13.9   8.5
## 10  1998     1    10  14.5   3.6

The names_from argument specified the column that contains the new column headers. In this example, the call would also have worked without this argument, but that may not always be the case.

9.6.3 select

With the select function, we can pick out a subset of the columns of a data.frame or tibble.

KAw %>% select(c(Month, Day, Tmax))
## # A tibble: 10 × 3
##    Month   Day  Tmax
##    <int> <int> <dbl>
##  1     1     1   8.2
##  2     1     2   9.1
##  3     1     3  10.4
##  4     1     4   8.4
##  5     1     5   7.7
##  6     1     6   8.1
##  7     1     7  12  
##  8     1     8  11.2
##  9     1     9  13.9
## 10     1    10  14.5

9.6.4 filter

The filter function reduces a data.frame or tibble to just the rows that fulfill certain conditions.

KAw %>% filter(Tmax>10)
## # A tibble: 5 × 5
##    Year Month   Day  Tmax  Tmin
##   <int> <int> <int> <dbl> <dbl>
## 1  1998     1     3  10.4   3.3
## 2  1998     1     7  12     6.9
## 3  1998     1     8  11.2   8.6
## 4  1998     1     9  13.9   8.5
## 5  1998     1    10  14.5   3.6

9.6.5 mutate

The mutate function is a work horse for creating, modifying, and deleting columns from a data.frame or tibble.

Let’s first create new columns, e.g. two columns that contain Tmin and Tmax in Kelvin.

KAw_K <- KAw %>% mutate(Tmax_K = Tmax + 273.15, Tmin_K = Tmin + 273.15)
KAw_K
## # A tibble: 10 × 7
##     Year Month   Day  Tmax  Tmin Tmax_K Tmin_K
##    <int> <int> <int> <dbl> <dbl>  <dbl>  <dbl>
##  1  1998     1     1   8.2   5.1   281.   278.
##  2  1998     1     2   9.1   5     282.   278.
##  3  1998     1     3  10.4   3.3   284.   276.
##  4  1998     1     4   8.4   4.5   282.   278.
##  5  1998     1     5   7.7   4.5   281.   278.
##  6  1998     1     6   8.1   4.4   281.   278.
##  7  1998     1     7  12     6.9   285.   280.
##  8  1998     1     8  11.2   8.6   284.   282.
##  9  1998     1     9  13.9   8.5   287.   282.
## 10  1998     1    10  14.5   3.6   288.   277.

Now we delete these columns again, by setting them to NULL.

KAw_K %>% mutate(Tmin_K = NULL, Tmax_K = NULL)
## # A tibble: 10 × 5
##     Year Month   Day  Tmax  Tmin
##    <int> <int> <int> <dbl> <dbl>
##  1  1998     1     1   8.2   5.1
##  2  1998     1     2   9.1   5  
##  3  1998     1     3  10.4   3.3
##  4  1998     1     4   8.4   4.5
##  5  1998     1     5   7.7   4.5
##  6  1998     1     6   8.1   4.4
##  7  1998     1     7  12     6.9
##  8  1998     1     8  11.2   8.6
##  9  1998     1     9  13.9   8.5
## 10  1998     1    10  14.5   3.6

Now I’ll replace the original temperature values directly with the Fahrenheit values. The following code modifies these columns accordingly.

KAw %>% mutate(Tmin = Tmin + 273.15, Tmax = Tmax + 273.15)
## # A tibble: 10 × 5
##     Year Month   Day  Tmax  Tmin
##    <int> <int> <int> <dbl> <dbl>
##  1  1998     1     1  281.  278.
##  2  1998     1     2  282.  278.
##  3  1998     1     3  284.  276.
##  4  1998     1     4  282.  278.
##  5  1998     1     5  281.  278.
##  6  1998     1     6  281.  278.
##  7  1998     1     7  285.  280.
##  8  1998     1     8  284.  282.
##  9  1998     1     9  287.  282.
## 10  1998     1    10  288.  277.

There are many other interesting things you can do with mutate, so please check out the help file for more options.

9.6.6 arrange

arrange is a function to sort data in data.frames or tibbles.

KAw %>% arrange(Tmax, Tmin)
## # A tibble: 10 × 5
##     Year Month   Day  Tmax  Tmin
##    <int> <int> <int> <dbl> <dbl>
##  1  1998     1     5   7.7   4.5
##  2  1998     1     6   8.1   4.4
##  3  1998     1     1   8.2   5.1
##  4  1998     1     4   8.4   4.5
##  5  1998     1     2   9.1   5  
##  6  1998     1     3  10.4   3.3
##  7  1998     1     8  11.2   8.6
##  8  1998     1     7  12     6.9
##  9  1998     1     9  13.9   8.5
## 10  1998     1    10  14.5   3.6

You can also sort in descending order.

KAw %>% arrange(desc(Tmax), Tmin)
## # A tibble: 10 × 5
##     Year Month   Day  Tmax  Tmin
##    <int> <int> <int> <dbl> <dbl>
##  1  1998     1    10  14.5   3.6
##  2  1998     1     9  13.9   8.5
##  3  1998     1     7  12     6.9
##  4  1998     1     8  11.2   8.6
##  5  1998     1     3  10.4   3.3
##  6  1998     1     2   9.1   5  
##  7  1998     1     4   8.4   4.5
##  8  1998     1     1   8.2   5.1
##  9  1998     1     6   8.1   4.4
## 10  1998     1     5   7.7   4.5

9.7 Loops

In addition to the tidyverse functions, we have to talk about an important code structure that will allow us to get a lot of work done in an efficient manner: loops. A loop allows us to repeat the same operation many times without having to explicitly retype (or copy and paste) the code. More importantly, it allows us to run the same code while introducing certain modifications in every run. You can read detailed explanations on loops here, but I’ll give you the basics in this chapter.

There are two basic types of loops: for loops and while loops. For both of them, we have to provide instructions that regulate the number of runs, as well as instructions on what to do in each of the runs.

9.7.1 For loops

In a for loop, we provide explicit instructions on how many times the code within the loop should be run. This is usually specified by providing a vector or list of elements and instructing R to run the code for each of these elements. This means that the number of times the code is run equals the number of elements in the vector or list. We need a counter (often called i but can also be any other variable name) to keep track of which run we’re in.

for (i in 1:3) print("Hello")
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"

This command ran the code three times, plotting the same output each time. We can make this structure more complex by providing multiple lines of code within winged brackets.

addition <- 1

for (i in 1:3)
{
  addition <- addition + 1
  print(addition)
}
## [1] 2
## [1] 3
## [1] 4

The code in this loop added 1 to the element addition (with an initial value of 1) in each iteration, and it printed the resulting value (note that you may have to explicitly instruct R to print such values, when the operation is embedded within a loop).

We can add more flexibility to the operations by using the index i within the code block.

addition <- 1

for (i in 1:3)
{
  addition <- addition + i
  print(addition)
}
## [1] 2
## [1] 4
## [1] 7

Now we added the respective value of i to the addition element in each of the runs. We can also use i in more creative ways.

names <- c("Paul", "Mary", "John")

for (i in 1:3)
{
  print(paste("Hello", names[i]))
}
## [1] "Hello Paul"
## [1] "Hello Mary"
## [1] "Hello John"

The counter doesn’t have to be numeric, but it can assume many other shapes, e.g. that of a string. We can therefore generate the same output as from the last code block by formulating this as follows:

for (i in c("Paul", "Mary", "John"))
{
  print(paste("Hello", i))
}
## [1] "Hello Paul"
## [1] "Hello Mary"
## [1] "Hello John"

9.7.2 While loops

We can also specify the decision on whether to run a loop with a while statement. The code is then run, until the specified condition is no longer fulfilled. This only makes sense, of course, if the condition can change as a result of what happens inside the loop.

cond <- 5

while (cond>0)
{
  print(cond)
  cond <- cond - 1
}
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1

As soon as cond reaches 0, the starting condition is no longer fulfilled, so that the code isn’t run again. Note that while loops can easily cause problems if the condition remains fulfilled regardless of what happens in the code block. Your code will then get hung up and needs to be cancelled manually.

9.8 apply functions

In addition to loops, R has another elegant method for applying certain operations to multiple elements at the same time. Don’t ask me why, but this is often a much faster way of getting things done. Such operations are implemented by the functions from the apply family: apply, lapply and sapply. The two central arguments that need to be provided to these functions are the list of items to apply the operation to, and the operation itself.

9.8.1 sapply

When you just want to apply an operation to a vector of elements, the easiest function to use is sapply. It only needs two arguments: the vector (X), and the function to be applied (FUN). To illustrate this, I’ll create a simple function, func, which just adds 1 to an object.

func <- function(x)
  x + 1

sapply(1:5, func)
## [1] 2 3 4 5 6

As you can see, the output is a vector of numbers that are 1 greater than the input vector. If we apply this function to a list of numbers, the output is a matrix (but the values are the same).

sapply(list(1:5), func)
##      [,1]
## [1,]    2
## [2,]    3
## [3,]    4
## [4,]    5
## [5,]    6

9.8.2 lapply

If we want the output to be a list, we can use the lapply function. It interprets the input element X as a list and returns a list with as many elements as were provided in that list, with each one containing the output of applying FUN to the respective element.

lapply(1:5, func)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 5
## 
## [[5]]
## [1] 6

Note that if the input element X is itself a list, this list is treated as one input element, with FUN applied to the entire list and the result returned as a single list element. It may be easier to look at an example to understand this.

lapply(list(1:5), func)
## [[1]]
## [1] 2 3 4 5 6

9.8.3 apply

The basic apply function is for applying functions to arrays, where we can operate either on the rows (MARGIN = 1) or on the columns (MARGIN = 1) of the array. We probably won’t use this much, so here are just some simple examples of what this function does. Feel free to look through the help file (or google - lots of helpful materials out there) to learn more about this.

mat <- matrix(c(1,1,1,2,2,2,3,3,3),c(3,3))

mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    1    2    3
## [3,]    1    2    3
apply(mat, MARGIN=1, sum) # adding up all the data in each row
## [1] 6 6 6
apply(mat, MARGIN=2, sum) # adding up all the data in each column
## [1] 3 6 9

Exercises on useful R tools

Please document all results of the following assignments in your learning logbook.

  1. Based on the Winters_hours_gaps dataset, use magrittr pipes and functions of the tidyverse to accomplish the following:
    1. Convert the dataset into a tibble
    1. Select only the top 10 rows of the dataset
    1. Convert the tibble to a long format, with separate rows for Temp_gaps and Temp
    1. Use ggplot2 to plot Temp_gaps and Temp as facets (point or line plot)
    1. Convert the dataset back to the wide format
    1. Select only the following columns: Year, Month, Day and Temp
    1. Sort the dataset by the Temp column, in descending order
  1. For the Winter_hours_gaps dataset, write a for loop to convert all temperatures (Temp column) to degrees Fahrenheit
  2. Execute the same operation with a function from the apply family
  3. Now use the tidyverse function mutate to achieve the same outcome
  4. Voluntary: consider taking a look at the instruction materials on all these functions, which I linked above, as well as at other sources on the internet. There’s a lot more to discover here, with lots of potential for making your coding more elegant and easier - and possibly even more fun!