Functions

module 2 week 4 programming functions

Introduction to writing functions in R.

Stephanie Hicks https://stephaniehicks.com/ (Department of Biostatistics, Johns Hopkins)https://www.jhsph.edu
09-21-2021

Pre-lecture materials

Read ahead

Before class, you can prepare by reading the following materials:

  1. https://r4ds.had.co.nz/functions.html
  2. https://adv-r.hadley.nz/functions.html?#functions
  3. https://swcarpentry.github.io/r-novice-inflammation/02-func-R/

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

At the end of this lesson you will:

Introduction

Writing functions is a core activity of an R programmer. It represents the key step of the transition from a mere “user” to a developer who creates new functionality for R. Functions are often used to encapsulate a sequence of expressions that need to be executed numerous times, perhaps under slightly different conditions. Functions are also often written when code must be shared with others or the public.

The writing of a function allows a developer to create an interface to the code, that is explicitly specified with a set of arguments (or parameters). This interface provides an abstraction of the code to potential users. This abstraction simplifies the users’ lives because it relieves them from having to know every detail of how the code operates. In addition, the creation of an interface allows the developer to communicate to the user the aspects of the code that are important or are most relevant.

Functions in R

Functions in R are “first class objects”, which means that they can be treated much like any other R object.

Important facts about R functions:

If you are familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis.

Your First Function

Functions are defined using the function() directive and are stored as R objects just like anything else. In particular, they are R objects of class “function”.

Here’s a simple function that takes no arguments and does nothing.

f <- function() {
        ## This is an empty function
}
## Functions have their own class
class(f)  
[1] "function"
## Execute this function
f()       
NULL

Not very interesting, but it is a start. The next thing we can do is create a function that actually has a non-trivial function body.

f <- function() {
        # this is the function body
        cat("Hello, world!\n") 
}
f()
Hello, world!

The last aspect of a basic function is the function arguments. These are the options that you can specify to the user that the user may explicitly set. For this basic function, we can add an argument that determines how many times “Hello, world!” is printed to the console.

f <- function(num) {
        for(i in seq_len(num)) {
                cat("Hello, world!\n")
        }
}
f(3)
Hello, world!
Hello, world!
Hello, world!

Obviously, we could have just cut-and-pasted the cat("Hello, world!\n") code three times to achieve the same effect, but then we wouldn’t be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see “Hello, world!”.

Pro tip: if you find yourself doing a lot of cutting and pasting, that’s usually a good sign that you might need to write a function.

Finally, the function above doesn’t return anything. It just prints “Hello, world!” to the console num number of times and then exits. But often it is useful if a function returns something that perhaps can be fed into another section of code.

This next function returns the total number of characters printed to the console.

f <- function(num) {
        hello <- "Hello, world!\n"
        for(i in seq_len(num)) {
                cat(hello)
        }
        chars <- nchar(hello) * num
        chars
}
meaningoflife <- f(3)
Hello, world!
Hello, world!
Hello, world!
print(meaningoflife)
[1] 42

In the above function, we did not have to indicate anything special in order for the function to return the number of characters. In R, the return value of a function is always the very last expression that is evaluated. Because the chars variable is the last expression that is evaluated in this function, that becomes the return value of the function.

Note that there is a return() function that can be used to return an explicitly value from a function, but it is rarely used in R (we will discuss it a bit later in this lesson).

Finally, in the above function, the user must specify the value of the argument num. If it is not specified by the user, R will throw an error.

f()
Error in f(): argument "num" is missing, with no default

We can modify this behavior by setting a default value for the argument num. Any function argument can have a default value, if you wish to specify it. Sometimes, argument values are rarely modified (except in special cases) and it makes sense to set a default value for that argument. This relieves the user from having to specify the value of that argument every single time the function is called.

Here, for example, we could set the default value for num to be 1, so that if the function is called without the num argument being explicitly specified, then it will print “Hello, world!” to the console once.

f <- function(num = 1) {
        hello <- "Hello, world!\n"
        for(i in seq_len(num)) {
                cat(hello)
        }
        chars <- nchar(hello) * num
        chars
}
f()    ## Use default value for 'num'
Hello, world!
[1] 14
f(2)   ## Use user-specified value
Hello, world!
Hello, world!
[1] 28

Remember that the function still returns the number of characters printed to the console.

Pro tip: The formals() function returns a list of all the formal arguments of a function
formals(f)
$num
[1] 1

Summary

We have written a function that

Arguments

Named arguments

Functions have named arguments, which can optionally have default values. Because all function arguments have names, they can be specified using their name.

f(num = 2)
Hello, world!
Hello, world!
[1] 28

Specifying an argument by its name is sometimes useful if a function has many arguments and it may not always be clear which argument is being specified. Here, our function only has one argument so there’s no confusion.

Argument matching

Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it’s really handy when doing interactive work at the command line. R functions arguments can be matched positionally or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to rnorm()

str(rnorm)
function (n, mean = 0, sd = 1)  
mydata <- rnorm(100, 2, 1)              ## Generate some data

100 is assigned to the n argument, 2 is assigned to the mean argument, and 1 is assigned to the sd argument, all by positional matching.

The following calls to the sd() function (which computes the empirical standard deviation of a vector of numbers) are all equivalent. Note that sd() has two arguments: x indicates the vector of numbers and na.rm is a logical indicating whether missing values should be removed or not.

## Positional match first argument, default for 'na.rm'
sd(mydata)                     
[1] 1.029314
## Specify 'x' argument by name, default for 'na.rm'
sd(x = mydata)                 
[1] 1.029314
## Specify both arguments by name
sd(x = mydata, na.rm = FALSE)  
[1] 1.029314

When specifying the function arguments by name, it doesn’t matter in what order you specify them. In the example below, we specify the na.rm argument first, followed by x, even though x is the first argument defined in the function definition.

## Specify both arguments by name
sd(na.rm = FALSE, x = mydata)     
[1] 1.029314

You can mix positional matching with matching by name. When an argument is matched by name, it is “taken out” of the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition.

sd(na.rm = FALSE, mydata)
[1] 1.029314

Here, the mydata object is assigned to the x argument, because it’s the only argument not yet specified.

Pro tip: The args() function displays the argument names and corresponding default values of a function

args(f)
function (num = 1) 
NULL

Below is the argument list for the lm() function, which fits linear models to a dataset.

args(lm)
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
NULL

The following two calls are equivalent.

lm(data = mydata, y ~ x, model = FALSE, 1:100)
lm(y ~ x, mydata, 1:100, model = FALSE)

Even though it’s legal, I don’t recommend messing around with the order of the arguments too much, since it can lead to some confusion.

Most of the time, named arguments are useful on the command line when you have a long argument list and you want to use the defaults for everything except for an argument near the end of the list. Named arguments also help if you can remember the name of the argument and not its position on the argument list. For example, plotting functions often have a lot of options to allow for customization, but this makes it difficult to remember exactly the position of every argument on the argument list.

Function arguments can also be partially matched, which is useful for interactive work. The order of operations when given an argument is

  1. Check for exact match for a named argument
  2. Check for a partial match
  3. Check for a positional match

Partial matching should be avoided when writing longer code or programs, because it may lead to confusion if someone is reading the code. However, partial matching is very useful when calling functions interactively that have very long argument names.

Lazy Evaluation

Arguments to functions are evaluated lazily, so they are evaluated only as needed in the body of the function.

In this example, the function f() has two arguments: a and b.

f <- function(a, b) {
        a^2
} 
f(2)
[1] 4

This function never actually uses the argument b, so calling f(2) will not produce an error because the 2 gets positionally matched to a. This behavior can be good or bad. It’s common to write a function that doesn’t use an argument and not notice it simply because R never throws an error.

This example also shows lazy evaluation at work, but does eventually result in an error.

f <- function(a, b) {
        print(a)
        print(b)
}
f(45)
[1] 45
Error in print(b): argument "b" is missing, with no default

Notice that “45” got printed first before the error was triggered. This is because b did not have to be evaluated until after print(a). Once the function tried to evaluate print(b) the function had to throw an error.

The ... Argument

There is a special argument in R known as the ... argument, which indicates a variable number of arguments that are usually passed on to other functions. The ... argument is often used when extending another function and you don’t want to copy the entire argument list of the original function

For example, a custom plotting function may want to make use of the default plot() function along with its entire argument list. The function below changes the default for the type argument to the value type = "l" (the original default was type = "p").

myplot <- function(x, y, type = "l", ...) {
        plot(x, y, type = type, ...)         ## Pass '...' to 'plot' function
}

Generic functions use ... so that extra arguments can be passed to methods.

mean
function (x, ...) 
UseMethod("mean")
<bytecode: 0x7f8692d35048>
<environment: namespace:base>

The ... argument is necessary when the number of arguments passed to the function cannot be known in advance. This is clear in functions like paste() and cat().

args(paste)
function (..., sep = " ", collapse = NULL, recycle0 = FALSE) 
NULL
args(cat)
function (..., file = "", sep = " ", fill = FALSE, labels = NULL, 
    append = FALSE) 
NULL

Because both paste() and cat() print out text to the console by combining multiple character vectors together, it is impossible for those functions to know in advance how many character vectors will be passed to the function by the user. So the first argument to either function is ....

Arguments Coming After the ... Argument

One catch with ... is that any arguments that appear after ... on the argument list must be named explicitly and cannot be partially matched or matched positionally.

Take a look at the arguments to the paste() function.

args(paste)
function (..., sep = " ", collapse = NULL, recycle0 = FALSE) 
NULL

With the paste() function, the arguments sep and collapse must be named explicitly and in full if the default values are not going to be used.

Here I specify that I want “a” and “b” to be pasted together and separated by a colon.

paste("a", "b", sep = ":")
[1] "a:b"

If I don’t specify the sep argument in full and attempt to rely on partial matching, I don’t get the expected result.

paste("a", "b", se = ":")
[1] "a b :"

Functions are for humans and computers

As you start to write your own functions, it’s important to keep in mind that functions are not just for the computer, but are also for humans. Technically, R does not care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.

The name of a function is important. In an ideal world, you want the name of your function to be short but clearly describe what the function does. This is not always easy, but here are some tips.

The function names should be verbs, and arguments should be nouns.

There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. mean() is better than compute_mean()). A good sign that a noun might be a better choice is if you are using a very broad verb like “get”, “compute”, “calculate”, or “determine”. Use your best judgement and do not be afraid to rename a function if you figure out a better name later.

# Too short
f()

# Not a verb, or descriptive
my_awesome_function()

# Long, but clear
impute_missing()
collapse_years()

If your function name is composed of multiple words, use “snake_case”, where each lowercase word is separated by an underscore. “camelCase” is a popular alternative. It does not really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it. R itself is not very consistent, but there is nothing you can do about that. Make sure you do not fall into the same trap by making your code as consistent as possible.

# Never do this!
col_mins <- function(x, y) {}
rowMaxes <- function(y, x) {}

If you have a family of functions that do similar things, make sure they have consistent names and arguments. Use a common prefix to indicate that they are connected. That is better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.

# Good
input_select()
input_checkbox()
input_text()

# Not so good
select_input()
checkbox_input()
text_input()

Where possible, avoid overriding existing functions and variables. It is impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.

# Don't do this!
T <- FALSE
c <- 10
mean <- function(x) sum(x)

Use comments, lines starting with #, to explain the “why” of your code. You generally should avoid comments that explain the “what” or the “how”. If you can’t understand what the code does from reading it, you should think about how to rewrite it to be more clear.

Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did you try that didn’t work? It’s a great idea to capture that sort of thinking in a comment.

Environment

The last component of a function is its environment. This is not something you need to understand deeply when you first start writing functions. However, it’s important to know a little bit about environments because they are crucial to how functions work.

The environment of a function controls how R finds the value associated with a name.

For example, take this function:

f <- function(x) {
  x + y
} 

In many programming languages, this would be an error, because 1y1 is not defined inside the function. In R, this is valid code because R uses rules called lexical scoping to find the value associated with a name. Since y is not defined inside the function, R will look in the environment where the function was defined:

y <- 100
f(10)
[1] 110

y <- 1000
f(10)
[1] 1010

This behavior seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it does not cause too many problems (especially if you regularly restart R to get to a clean slate).

The advantage of this behavior is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For f() that includes the behavior of two things that you might not expect: { and +. This allows you to do devious things like:

`+` <- function(x, y) {
  if (runif(1) < 0.1) {
    sum(x, y)
  } else {
    sum(x, y) * 1.1
  }
}
table(replicate(1000, 1 + 2))

  3 3.3 
104 896 
rm(`+`)

This is a common phenomenon in R. R places few limits on your power. You can do many things that you can’t do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like ggplot2 and dplyr possible.

Pro tip: If you are interested in learning more about scoping, check out

Summary

Post-lecture materials

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions:

  1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
mean(is.na(x))

x / sum(x, na.rm = TRUE)
  1. Read the complete lyrics to “Little Bunny Foo Foo”. There is a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.

  2. Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.

  3. What does the trim argument to mean() do? When might you use it?

  4. The default value for the method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default?

Additional Resources

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Hicks (2021, Sept. 21). Statistical Computing: Functions. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/posts/2021-09-21-functions/

BibTeX citation

@misc{hicks2021functions,
  author = {Hicks, Stephanie},
  title = {Statistical Computing: Functions},
  url = {https://stephaniehicks.com/jhustatcomputing2021/posts/2021-09-21-functions/},
  year = {2021}
}