Introduction to R and RStudio!

Let’s dig into the R programming language and the RStudio integrated developer environment
module 1
week 1
R
programming
RStudio
Author
Affiliation

Department of Biostatistics, Johns Hopkins

Published

August 30, 2022

There are only two kinds of languages: the ones people complain about and the ones nobody uses. —Bjarne Stroustrup

Pre-lecture materials

Read ahead

Read ahead

Before class, you can prepare by reading the following materials:

  1. An overview and history of R from Roger Peng
  2. Installing R and RStudio from Rafael Irizarry
  3. Getting Started in R and RStudio from Rafael Irizarry

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

Learning objectives

At the end of this lesson you will:

  • Learn about (some of) the history of R.
  • Identify some of the strengths and weaknesses of R.
  • Install R and Rstudio on your computer.
  • Know how to install and load R packages.

Overview and history of R

Below is a very quick introduction to R, to get you set up and running. We’ll go deeper into R and coding later.

tl;dr (R in a nutshell)

Like every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:

  • R is open-source, freely accessible, and cross-platform (multiple OS).
  • R is a “high-level” programming language, relatively easy to learn.
    • While “Low-level” programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.
    • In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.
  • R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!
  • R integrates easily with document preparation systems like \(\LaTeX\), but R files can also be used to create .docx, .pdf, .html, .ppt files with integrated R code output and graphics.
  • The R Community is very dynamic, helpful and welcoming.
  • Through R packages, it is easy to get lots of state-of-the-art algorithms.
  • Documentation and help files for R are generally good.

While we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).

Depending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.

With the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.

Artwork by Allison Horst on learning R

[Source: Artwork by Allison Horst]

Basic Features of R

Today R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.

One nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.

Another key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R’s base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.

R has maintained the original S philosophy (see box below), which is that it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.

Tip

For a great discussion on an overview and history of R and the S programming language, read through this chapter from Roger D. Peng.

Finally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter #rstats, #rtistry, and Reddit.

Free Software

A major advantage that R has over many other statistical packages and is that it’s free in the sense of free software (it’s also free in the sense of free beer). The copyright for the primary source code for R is held by the R Foundation and is published under the GNU General Public License version 2.0.

According to the Free Software Foundation, with free software, you are granted the following four freedoms

  • The freedom to run the program, for any purpose (freedom 0).

  • The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.

  • The freedom to redistribute copies so you can help your neighbor (freedom 2).

  • The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.

Tip

You can visit the Free Software Foundation’s web site to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web site is an interesting read if you happen to have some spare time.

Design of the R System

The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.

The R system is divided into 2 conceptual parts:

  1. The “base” R system that you download from CRAN:
  1. Everything else.

R functionality is divided into a number of packages.

  • The “base” R system contains, among other things, the base package which is required to run R and contains the most fundamental functions.

  • The other packages contained in the “base” system include utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.

  • There are also “Recommended” packages: boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.

When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:

  • There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.

  • There are also many packages associated with the Bioconductor project.

  • People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.

Questions
  1. How many R packages are on CRAN today?
  2. How many R packages are on Bioconductor today?
  3. How many R packages are on GitHub today?

Limitations of R

No programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old days”).

Another commonly cited limitation of R is that objects must generally be stored in physical memory (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.

At a higher level one “limitation” of R is that its functionality is based on consumer demand and (voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.

Using R and RStudio

If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. — Nicholas Tierney

[Source]

The RStudio layout has the following features:

  • On the upper left, something called a Rmarkdown script
  • On the lower left, the R console
  • On the lower right, the view for files, plots, packages, help, and viewer.
  • On the upper right, the environment / history pane

A screenshot of the RStudio integrated developer environment (IDE) – aka the working environment

The R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we’ll learn about these files later).

The file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.

Installing R and RStudio

  • If you have not already, install R first. If you already have R installed, make sure it is a fairly recent version, version 4.0 or newer. If yours is older, I suggest you update (install a new R version).
  • Once you have R installed, install the free version of RStudio Desktop. Again, make sure it’s a recent version.
Tip

Installing R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry’s dsbook

If things don’t work, ask for help in the Courseplus discussion board.

I personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).

RStudio default options

To first get set up, I highly recommend changing the following setting

Tools > Global Options (or Cmd + , on macOS)

Under the General tab:

  • For workspace
    • Uncheck restore .RData into workspace at startup
    • Save workspace to .RData on exit : “Never”
  • For History
    • Uncheck “Always save history (even when not saving .RData)
    • Uncheck “Remove duplicate entries in history”

This means that you won’t save the objects and other things that you create in your R session and reload them. This is important for two reasons

  1. Reproducibility: you don’t want to have objects from last week cluttering your session
  2. Privacy: you don’t want to save private data or other things to your session. You only want to read these in.

Your “history” is the commands that you have entered into R.

Additionally, not saving your history means that you won’t be relying on things that you typed in the last session, which is a good habit to get into!

Installing and loading R packages

As we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.

The “official” place for packages is the CRAN website. If you are interested in packages on a specific topic, the CRAN task views provide curated descriptions of packages sorted by topic.

To install an R package from CRAN, one can simply call the install.packages() function and pass the name of the package as an argument. For example, to install the ggplot2 package from CRAN: open RStudio,go to the R prompt (the > symbol) in the lower-left corner and type

install.packages("ggplot2")

and the appropriate version of the package will be installed.

Often, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.

Questions
  1. As you installed the ggplot2 package, what other packages were installed?
  2. What happens if you tried to install GGplot2?

It could be that you already have all packages required by ggplot2 installed. In that case, you will not see any other packages installed. To see which of the packages above ggplot2 needs (and thus installs if it is not present), type into the R console:

tools::package_dependencies("ggplot2")

In RStudio, you can also install (and update/remove) packages by clicking on the ‘Packages’ tab in the bottom right window.

It is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the remotes package and then use the following function

remotes::install_github()

We will not do that now, but it is quite likely that at one point later in this course we will.

You only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the library() command. For instance to load the ggplot2 package, type

library('ggplot2')

You may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.

This was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.

Getting started in RStudio

While one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the R Studio Essentials collection of videos.

Tip

For more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the dsbook:

This chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.

Post-lecture materials

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions
  1. If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the “Four Freedoms” of Free Software?

  2. What is an R package and what is it used for?

  3. What function in R can be used to install packages from CRAN?

  4. What is a limitation of the current R system?

Additional Resources

Tip
  • R for Data Science by Wickham & Grolemund (2017). Covers most of the basics of using R for data analysis.

  • Advanced R by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.

  • RStudio IDE cheatsheet

rtistry

[‘Water Colours’ from Danielle Navarro https://art.djnavarro.net]