`<- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70) inches `

# Pre-lecture materials

### Read ahead

### Acknowledgements

Material for this lecture was borrowed and adopted from

# Learning objectives

# Vectorization

Writing `for`

and `while`

loops are useful and easy to understand, but in R we rarely use them.

As you learn more R, you will realize that **vectorization** is preferred over for-loops since it results in shorter and clearer code.

## Vector arithmetics

### Rescaling a vector

In R, arithmetic operations on **vectors occur element-wise**. For a quick example, suppose we have height in inches:

and want to convert to centimeters.

Notice what happens when we multiply inches by 2.54:

`* 2.54 inches `

` [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80`

In the line above, we **multiplied each element** by 2.54.

Similarly, if for each entry we want to compute how many inches taller or shorter than 69 inches (the average height for males), we can subtract it from every entry like this:

`- 69 inches `

` [1] 0 -7 -3 1 1 4 -2 4 -2 1`

### Two vectors

If we have **two vectors of the same length**, and we sum them in R, they will be **added entry by entry** as follows:

```
<- 1:10
x <- 1:10
y + y x
```

` [1] 2 4 6 8 10 12 14 16 18 20`

The same holds for other mathematical operations, such as `-`

, `*`

and `/`

.

```
<- 1:10
x sqrt(x)
```

```
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278
```

```
<- 1:10
y *y x
```

` [1] 1 4 9 16 25 36 49 64 81 100`

# Functional loops

While `for`

loops are perfectly valid, when you use vectorization in an element-wise fashion, there is no need for `for`

loops because we can apply what are called functional loops.

**Functional loops** are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here are a list of them:

`lapply()`

: Loop over a list and evaluate a function on each element`sapply()`

: Same as`lapply`

but try to simplify the result`apply()`

: Apply a function over the margins of an array`tapply()`

: Apply a function over subsets of a vector`mapply()`

: Multivariate version of`lapply`

(won’t cover)

An auxiliary function `split()`

is also useful, particularly in conjunction with `lapply()`

.

`lapply()`

The `lapply()`

function does the following simple series of operations:

- it loops over a list, iterating over each element in that list
- it applies a
*function*to each element of the list (a function that you specify) - and returns a list (the
`l`

in`lapply()`

is for “list”).

This function takes three arguments: (1) a list `X`

; (2) a function (or the name of a function) `FUN`

; (3) other arguments via its `...`

argument. If `X`

is not a list, it will be coerced to a list using `as.list()`

.

The body of the `lapply()`

function can be seen here.

` lapply`

```
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x14f92f928>
<environment: namespace:base>
```

It is important to remember that `lapply()`

always returns a list, regardless of the class of the input.

**Functions in R can be** used this way and can be **passed back and forth as arguments** just like any other object inR.

When you pass a function to another function, you do not need to include the open and closed parentheses `()`

like you do when you are **calling** a function.

You can use `lapply()`

to evaluate a function multiple times each with a different argument.

Next is an example where I call the `runif()`

function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.

```
<- 1:4
x lapply(x, runif)
```

```
[[1]]
[1] 0.3924746
[[2]]
[1] 0.807656 0.852134
[[3]]
[1] 0.9680554 0.6216622 0.4746080
[[4]]
[1] 0.09363509 0.80682941 0.44572025 0.55164581
```

Functions that you pass to `lapply()`

may have other arguments. For example, the `runif()`

function has a `min`

and `max`

argument too.

Here is where the `...`

argument to `lapply()`

comes into play. Any arguments that you place in the `...`

argument will get passed down to the function being applied to the elements of the list.

Here, the `min = 0`

and `max = 10`

arguments are passed down to `runif()`

every time it gets called.

```
<- 1:4
x lapply(x, runif, min = 0, max = 10)
```

```
[[1]]
[1] 7.339994
[[2]]
[1] 6.159324 4.167184
[[3]]
[1] 1.3182169 6.3869630 0.2614679
[[4]]
[1] 7.640224 1.984159 9.285444 2.845784
```

So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.

The `lapply()`

function (and its friends) makes heavy use of *anonymous* functions. Anonymous functions are like members of Project Mayhem—they have no names. These functions are generated “on the fly” as you are using `lapply()`

. Once the call to `lapply()`

is finished, the function disappears and does not appear in the workspace.

This is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`

, but if it’s going to be more complicated, it’s probably a better idea to define the function separately.

For example, I could have done the following.

```
<- function(elt) {
f 1]
elt[,
}lapply(x, f)
```

```
$a
[1] 1 2
$b
[1] 1 2 3
```

Whether you use an anonymous function or you define a function first depends on your context. If you think the function `f`

is something you are going to need a lot in other parts of your code, you might want to define it separately. But if you are just going to use it for this call to `lapply()`

, then it is probably simpler to use an anonymous function.

`sapply()`

The `sapply()`

function behaves similarly to `lapply()`

; the only real difference is in the return value. `sapply()`

will try to simplify the result of `lapply()`

if possible. Essentially, `sapply()`

calls `lapply()`

on its input and then applies the following algorithm:

If the result is a list where every element is length 1, then a vector is returned

If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.

If it can’t figure things out, a list is returned

Here’s the result of calling `lapply()`

.

```
<- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
x lapply(x, mean)
```

```
$a
[1] 2.5
$b
[1] -0.7692304
$c
[1] 1.1845
$d
[1] 5.011145
```

Notice that `lapply()`

returns a list (as usual), but that each element of the list has length 1.

Here’s the result of calling `sapply()`

on the same list.

`sapply(x, mean) `

```
a b c d
2.5000000 -0.7692304 1.1844997 5.0111453
```

Because the result of `lapply()`

was a list where each element had length 1, `sapply()`

collapsed the output into a numeric vector, which is often more useful than a list.

`split()`

The `split()`

function takes a vector or other objects and splits it into groups determined by a factor or list of factors.

The arguments to `split()`

are

`str(split)`

`function (x, f, drop = FALSE, ...) `

where

`x`

is a vector (or list) or data frame`f`

is a factor (or coerced to one) or a list of factors`drop`

indicates whether empty factors levels should be dropped

The combination of `split()`

and a function like `lapply()`

or `sapply()`

is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying that function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as “map-reduce” in other contexts.

Here we simulate some data and split it according to a factor variable. Note that we use the `gl()`

function to “generate levels” in a factor variable.

```
<- c(rnorm(10), runif(10), rnorm(10, 1))
x <- gl(3, 10) # generate factor levels
f f
```

```
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
```

`split(x, f)`

```
$`1`
[1] 0.06440437 1.78480833 -0.94373825 1.94781191 -0.16936618 -0.58442286
[7] 1.23801276 0.02465268 -0.35022800 -0.03086819
$`2`
[1] 0.1623650 0.7931292 0.5370609 0.6692380 0.2197358 0.2657368 0.6490295
[8] 0.2862331 0.8169028 0.9344586
$`3`
[1] 0.13424958 0.31285258 2.39555383 -0.11859862 -0.08085121 -0.17574475
[7] -1.08308465 0.18204113 1.13764707 0.56204495
```

A common idiom is `split`

followed by an `lapply`

.

`lapply(split(x, f), mean)`

```
$`1`
[1] 0.2981067
$`2`
[1] 0.533389
$`3`
[1] 0.326611
```

### Splitting a Data Frame

```
library(datasets)
head(airquality)
```

```
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
```

We can split the `airquality`

data frame by the `Month`

variable so that we have separate sub-data frames for each month.

```
<- split(airquality, airquality$Month)
s str(s)
```

```
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
```

Then we can take the column means for `Ozone`

, `Solar.R`

, and `Wind`

for each sub-data frame.

```
lapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
```

```
$`5`
Ozone Solar.R Wind
NA NA 11.62258
$`6`
Ozone Solar.R Wind
NA 190.16667 10.26667
$`7`
Ozone Solar.R Wind
NA 216.483871 8.941935
$`8`
Ozone Solar.R Wind
NA NA 8.793548
$`9`
Ozone Solar.R Wind
NA 167.4333 10.1800
```

Using `sapply()`

might be better here for a more readable output.

```
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
```

```
5 6 7 8 9
Ozone NA NA NA NA NA
Solar.R NA 190.16667 216.483871 NA 167.4333
Wind 11.62258 10.26667 8.941935 8.793548 10.1800
```

Unfortunately, there are `NA`

s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans`

function to remove the `NA`

s before computing the mean.

```
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")],
na.rm = TRUE)
})
```

```
5 6 7 8 9
Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind 11.62258 10.26667 8.941935 8.793548 10.18000
```

## tapply

`tapply()`

is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()`

and `sapply()`

for vectors only. I’ve been told that the “t” in `tapply()`

refers to “table”, but that is unconfirmed.

`str(tapply)`

`function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) `

The arguments to `tapply()`

are as follows:

`X`

is a vector`INDEX`

is a factor or a list of factors (or else they are coerced to factors)`FUN`

is a function to be applied- … contains other arguments to be passed
`FUN`

`simplify`

, should we simplify the result?

We can also apply functions that return more than a single value. In this case, `tapply()`

will not simplify the result and will return a list. Here’s an example of finding the `range()`

(min and max) of each sub-group.

`tapply(x, f, range)`

```
$`1`
[1] -1.217068 1.723239
$`2`
[1] 0.0620568 0.8443268
$`3`
[1] -0.1079284 2.8115679
```

`apply()`

The `apply()`

function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using `apply()`

is not really faster than writing a loop, but it works in one line and is highly compact.

`str(apply)`

`function (X, MARGIN, FUN, ..., simplify = TRUE) `

The arguments to `apply()`

are

`X`

is an array`MARGIN`

is an integer vector indicating which margins should be “retained”.`FUN`

is a function to be applied`...`

is for other arguments to be passed to`FUN`

You’ve probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly *is* the second argument to `apply()`

?

The `MARGIN`

argument essentially indicates to `apply()`

which dimension of the array you want to preserve or retain.

So when taking the mean of each column, I specify

`apply(x, 2, mean)`

because I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run

`apply(x, 1, mean)`

because I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).

### Col/Row Sums and Means

The shortcut functions are heavily optimized and hence are **much** faster, but you probably won’t notice unless you’re using a large matrix.

Another nice aspect of these functions is that they are a bit more descriptive. It’s arguably more clear to write `colMeans(x)`

in your code than `apply(x, 2, mean)`

.

### Other Ways to Apply

You can do more than take sums and means with the `apply()`

function.

## Vectorizing a Function

Let’s talk about how we can **“vectorize” a function**.

What this means is that we can write function that typically only takes single arguments and create a new function that can take vector arguments.

This is often needed when you want to plot functions.

There’s even a function in R called `Vectorize()`

that **automatically can create a vectorized version of your function**.

So we could create a `vsumsq()`

function that is fully vectorized as follows.

```
<- Vectorize(sumsq, c("mu", "sigma"))
vsumsq vsumsq(1:10, 1:10, x)
```

```
[1] 203.24928 122.09428 108.16721 103.66455 101.75042 100.80246 100.28605
[8] 99.98663 99.80583 99.69400
```

Pretty cool, right?

# Summary

The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form

The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.

Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere

The

`split()`

function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.

# Post-lecture materials

### Final Questions

Here are some post-lecture questions to help you think about the material discussed.