install.packages("purrr")Pre-lecture materials
Read ahead
Before class, you can prepare by reading the following materials:
Prerequisites
Before starting you must install the additional package:
- purrr- this provides a consistent functional programming interface to work with functions and vectors
You can do this by calling
or use the “Install Packages…” option from the “Tools” menu in RStudio.
Acknowledgements
Material for this lecture was borrowed and adopted from
Learning objectives
At the end of this lesson you will:
- Be familiar with the concept of functional programming
- Get comfortable with the major functions in purrr, e.g. themapfamily,reduce
- Write your loops with mapfunctions instead of theforloop
Functional Programming
The characteristics
At it is core, functional programming treats functions equally as other data structures, namely first class functions.
In R, this means that you can do many of the things with a function that you can do with a vector: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.
What do you mean?
- Assign a function to a variable
foo <- function(){
  return("This is foo.")
}
class(foo)[1] "function"- Store functions in a list
foo_list <- list( 
  fun_1 = function() return("foo_1"),
  fun_2 = function() return("foo_2")
)
str(foo_list)List of 2
 $ fun_1:function ()  
  ..- attr(*, "srcref")= 'srcref' int [1:8] 2 11 2 36 11 36 2 2
  .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f925303de48> 
 $ fun_2:function ()  
  ..- attr(*, "srcref")= 'srcref' int [1:8] 3 11 3 36 11 36 3 3
  .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f925303de48> - Pass functions as arguments to other functions
shell <- function(f) f()
shell(foo_list$fun_1)[1] "foo_1"shell(foo_list$fun_2)[1] "foo_2"- Create functions inside of functions & return them as the result of a function
foo_wrap <- function(){
  foo_2 <- function(){
    return("This is foo_2.")
  }
  return(foo_2)
}
foo_wrap()function(){
    return("This is foo_2.")
  }
<environment: 0x7f92410bf898>(foo_wrap())()[1] "This is foo_2."The bottom line, you can manipulate functions as the same way as you can to a vector or a matrix.
Why is functional programming important?
Functional programming introduces a new style of programming, namely functional style. Broadly speaking, this programming style encourages programmers to write a big function as many smaller isolated functions, where each function addresses one specific task.
As a by-product, funcitonal style motivates more humanly readable code, and recyclable code.
"data_set.csv" |> 
  import_data_from_file() |> 
  data_cleaning() |> 
  run_regression() |>
  model_diagnostics() |>
  model_visualization()
"data_set2.csv" |> 
  import_data_from_file() |> 
  data_cleaning() |> 
  run_different_regression() |>
  model_diagnostics() |>
  model_visualization()R provides some pipe operators to make code readable, e.g. |> from the base R, %>% from the package magrittr. These pipe operators operate like a pipe, piping the output from the previous function (left hand side of the pipe operator) to the following function (right hand side of the pipe operator). The pipe operator |> was introduced in R 4.1.0 and requires no loading of additional packages, unlike %>%.
A keyboard shortcut to type a pipe operator in RStudio is shift+cmd+m for Mac or shift+ctrl+m in Windows.
purrr: the functional programming toolkit
The R package purrr, as one important component of the tidyverse, provides a interface to manipulate vectors in the functional style.
purrrenhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
purrr cheatsheet
It is very difficulty, if not impossible, to remember all functions that a package offers as well as their use cases. Hence, purrr developers offer a nice compact cheatsheet with visualizations at https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf. Similar cheatsheets are available for other tidyverse packages.
The most popular function in purrr is map() which iterates over the supplied data structure and apply a function during the iterations. Beside the map function,purrr also offers a series of useful functions to manipulate list the data structure.
The map family
The map family of functions provides a convenient way to iterate through vectors or lists and apply functions during this iteration. Depending on the dimension of the input and the format of the output, there are many different variants of the basic map function.
map relate to functional programming
Because their arguments include functions (.f) besides data (.x), map functions are considered as a convinient interface to implement functional programming.
map as a foor loop
library(purrr)
triple <- function(x) x * 3
# for loop
loop_ret <- list()
for(i in 1:3){
  loop_ret[i] <- triple(i)
}
# map implementation
map_eg1 <- map(.x = 1:3, .f = triple)
map_eg2 <- map(.x = 1:3, .f = ~triple(.x))
map_eg3 <- map(.x = 1:3, .f = function(x) triple(x))
identical(loop_ret,map_eg1)[1] TRUEidentical(loop_ret,map_eg2)[1] TRUEidentical(loop_ret,map_eg3)[1] TRUEmap with a data frame
tmp_dat <- data.frame(
  x = 1:5,
  y = 6:10
)
tmp_dat |> 
  map(.f = mean)$x
[1] 3
$y
[1] 8# Alternatively
# map(.x = tmp_dat, .f = mean)data.frame vs list
data.frame is a special case of list, where each column as one item of the list. Don’t confuse with each row as an item.
class(tmp_dat)[1] "data.frame"typeof(tmp_dat)[1] "list"Extra arguments for functions
tmp_dat2 <- as.list(tmp_dat)
tmp_dat2$y[6] <- NA
str(tmp_dat2)List of 2
 $ x: int [1:5] 1 2 3 4 5
 $ y: int [1:6] 6 7 8 9 10 NAtmp_dat2 |> map(.f = mean) # No extra arguments$x
[1] 3
$y
[1] NAtmp_dat2 |> 
  map(.f = mean, na.rm = TRUE) # With extra arguments$x
[1] 3
$y
[1] 8tmp_dat2 |> 
  map(.f = function(x, remove_na) mean(x, na.rm = remove_na),
      remove_na = TRUE)$x
[1] 3
$y
[1] 8Stratified analysis with map
We use the mtcars from the package datasets to demonstrate
library(datasets)
str(mtcars)'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...unique(mtcars$cyl) # different numbers of cylinders[1] 6 4 8We are interested in the averaged miles per gallon for vehicles with different numbers of cylinders
# Create a dataset for cylinders level
str_dat <- mtcars |> split(mtcars$cyl)
length(str_dat)[1] 3str(str_dat)List of 3
 $ 4:'data.frame':  11 obs. of  11 variables:
  ..$ mpg : num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  ..$ cyl : num [1:11] 4 4 4 4 4 4 4 4 4 4 ...
  ..$ disp: num [1:11] 108 146.7 140.8 78.7 75.7 ...
  ..$ hp  : num [1:11] 93 62 95 66 52 65 97 66 91 113 ...
  ..$ drat: num [1:11] 3.85 3.69 3.92 4.08 4.93 4.22 3.7 4.08 4.43 3.77 ...
  ..$ wt  : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  ..$ qsec: num [1:11] 18.6 20 22.9 19.5 18.5 ...
  ..$ vs  : num [1:11] 1 1 1 1 1 1 1 1 0 1 ...
  ..$ am  : num [1:11] 1 0 0 1 1 1 0 1 1 1 ...
  ..$ gear: num [1:11] 4 4 4 4 4 4 3 4 5 5 ...
  ..$ carb: num [1:11] 1 2 2 1 2 1 1 1 2 2 ...
 $ 6:'data.frame':  7 obs. of  11 variables:
  ..$ mpg : num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
  ..$ cyl : num [1:7] 6 6 6 6 6 6 6
  ..$ disp: num [1:7] 160 160 258 225 168 ...
  ..$ hp  : num [1:7] 110 110 110 105 123 123 175
  ..$ drat: num [1:7] 3.9 3.9 3.08 2.76 3.92 3.92 3.62
  ..$ wt  : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
  ..$ qsec: num [1:7] 16.5 17 19.4 20.2 18.3 ...
  ..$ vs  : num [1:7] 0 0 1 1 1 1 0
  ..$ am  : num [1:7] 1 1 0 0 0 0 1
  ..$ gear: num [1:7] 4 4 3 3 4 4 5
  ..$ carb: num [1:7] 4 4 1 1 4 4 6
 $ 8:'data.frame':  14 obs. of  11 variables:
  ..$ mpg : num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
  ..$ cyl : num [1:14] 8 8 8 8 8 8 8 8 8 8 ...
  ..$ disp: num [1:14] 360 360 276 276 276 ...
  ..$ hp  : num [1:14] 175 245 180 180 180 205 215 230 150 150 ...
  ..$ drat: num [1:14] 3.15 3.21 3.07 3.07 3.07 2.93 3 3.23 2.76 3.15 ...
  ..$ wt  : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
  ..$ qsec: num [1:14] 17 15.8 17.4 17.6 18 ...
  ..$ vs  : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ am  : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ gear: num [1:14] 3 3 3 3 3 3 3 3 3 3 ...
  ..$ carb: num [1:14] 2 4 3 3 3 4 4 4 2 2 ...str_dat |> 
  map(.f = ~mean(.x$mpg))$`4`
[1] 26.66364
$`6`
[1] 19.74286
$`8`
[1] 15.1Matrix as the output
The map family include functions that organize the output in different data structures, whose names follow the pattern map_*. As we’ve seen, the map function return a list. The following functions will return a vector of a specific kind, e.g. map_lgl returns a vector of logical variables, map_chr returns a vector of strings. It is also possible to return the the results as data frames by row binding (map_dfr) or column binding (map_dfc).
str_dat |> 
  map_dbl(.f = ~mean(.x$mpg)) # returns a vector of doubles       4        6        8 
26.66364 19.74286 15.10000 str_dat |> 
  map_dfr(.f = ~colMeans(.x)) # return a data frame by row binding# A tibble: 3 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
2  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
3  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5 str_dat |> 
  map_dfc(.f = ~colMeans(.x)) # return a data frame by col binding# A tibble: 11 × 3
       `4`     `6`     `8`
     <dbl>   <dbl>   <dbl>
 1  26.7    19.7    15.1  
 2   4       6       8    
 3 105.    183.    353.   
 4  82.6   122.    209.   
 5   4.07    3.59    3.23 
 6   2.29    3.12    4.00 
 7  19.1    18.0    16.8  
 8   0.909   0.571   0    
 9   0.727   0.429   0.143
10   4.09    3.86    3.29 
11   1.55    3.43    3.5  Multiple Input
It is possible that an operation requires a pair of variables as input. While it is still managable in map to achieve this, there are better options provided in purrr, specifically map2 and pmap.
map_avg <- map_dbl(.x = mtcars, .f = mean)
map2_avg <- map2_dbl(.x = mtcars,
                     .y = list(weight = 1/nrow(mtcars)),
                     .f = ~sum(.x*.y))
identical(map_avg, map2_avg)[1] TRUEpmap_avg <- pmap_dbl(list(x = mtcars,
                          y = list(weight = 1/(2*nrow(mtcars))),
                          z = list(weight2 = 2)),
                     .f = ~sum(..1*..2*..3))
identical(map_avg, pmap_avg)[1] TRUE# Use element names in pmap
mtcars$weight <- 1/2
mtcars$weight2 <-  2
pmap_eg2 <- pmap_dbl(mtcars,
                     .f = function(mpg, weight, weight2, ...){
                       mpg * weight * weight2
                     })
identical(pmap_eg2, mtcars$mpg)[1] TRUENo output
It is possible that some operations don’t need any output during the iteration, e.g. saving the dataset. In this case, map will force an output, e.g. NULL. One can consider using walk instead. The function walk behaves exactly the same as map but does not output anything.
tmp_fldr <- tempdir()
map2(.x = str_dat,
     .y = 1:length(str_dat),
     .f = ~saveRDS(.x, 
                   file = paste0(tmp_fldr, "/",.y, ".rds"))
)$`4`
NULL
$`6`
NULL
$`8`
NULL# No output
walk2(.x = str_dat,
      .y = (1:length(str_dat)),
      .f = ~saveRDS(.x, 
                    file = paste0(tmp_fldr, "/",.y, ".rds"))
)Other functions in purrr
reduce and accumulate
purrr also provides functions to summarize a list by a preferred operator, namesly reduce. Its variant accumulate provides the history of this reduction process.
mtcars$weight <- 1/(2*nrow(mtcars))
mtcars$weight2 <-  2
reduce_eg <- 
  pmap_dbl(mtcars,
           .f = function(mpg, weight, weight2, ...){
             mpg * weight * weight2
           }) |> 
  reduce(`+`)
pmap_dbl(mtcars,
           .f = function(mpg, weight, weight2, ...){
             mpg * weight * weight2
           })|>
  head() |> # Only show the first 7 operations
  accumulate(`+`)[1] 0.656250 1.312500 2.025000 2.693750 3.278125 3.843750Working with list
Let’s move to the purrr cheatsheet at https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf.
Summary
- Introduction to functional programming.
- The R package purrrprovides a nice interface to functional programming and list manipulation.
- The function mapand its aternativemap_*provide a neat way to iterate over a list or vector with the output in different data structures.
- The function map2andpmapallow having more than one list as input.
- The function walkand its alternativeswalk2,walk_*do not provide any output.
- The functions reduceandaccumulatehelp to summarize a list with a preferred operator or function.
Post-lecture materials
- What does - imapand- iwalkdo? In this lecture note, can you find the one example possible to substitute with- imapand- iwalk? Hint: see the sub-section named No output
- Is there any function in the R base package provide nice interface for functional programming? Hint: - ?with,- ?within
- Can you write a section of code to demonstrate the central limited theorem primarily using the - purrrpackage and/or using the R base package?