Tidy data and the Tidyverse

Introduction to tidy data and how to convert between wide and long data with the tidyr R package
module 1
week 2
R
programming
tidyr
here
tidyverse
Author
Affiliation

Department of Biostatistics, Johns Hopkins

Published

September 8, 2022

“Happy families are all alike; every unhappy family is unhappy in its own way.” —- Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” —- Hadley Wickham

Pre-lecture materials

Read ahead

Read ahead

Before class, you can prepare by reading the following materials:

  1. Tidy Data paper published in the Journal of Statistical Software
  2. https://r4ds.had.co.nz/tidy-data
  3. tidyr cheat sheet from RStudio

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

Learning objectives

At the end of this lesson you will:

  • Define tidy data
  • Be able to transform non-tidy data into tidy data
  • Be able to transform wide data into long data
  • Be able to separate character columns into multiple columns
  • Be able to unite multiple character columns into one column

Tidy data

As we learned in the last lesson, one unifying concept of the tidyverse is the notion of tidy data.

As defined by Hadley Wickham in his 2014 paper published in the Journal of Statistical Software, a tidy dataset has the following properties:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

Artwork by Allison Horst on tidy data

[Source: Artwork by Allison Horst]

The purpose of defining tidy data is to highlight the fact that most data do not start out life as tidy.

In fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience).

  • Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.
Example

As a quick example, consider the following data illustrating religion and income survey data with the number of respondents with income range in column name.

This is in a classic table format:

library(tidyr)
relig_income
# A tibble: 18 × 11
   religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Agnostic           27      34      60      81      76     137     122     109
 2 Atheist            12      27      37      52      35      70      73      59
 3 Buddhist           27      21      30      34      33      58      62      39
 4 Catholic          418     617     732     670     638    1116     949     792
 5 Don’t know/r…      15      14      15      11      10      35      21      17
 6 Evangelical …     575     869    1064     982     881    1486     949     723
 7 Hindu               1       9       7       9      11      34      47      48
 8 Historically…     228     244     236     238     197     223     131      81
 9 Jehovah's Wi…      20      27      24      24      21      30      15      11
10 Jewish             19      19      25      25      30      95      69      87
11 Mainline Prot     289     495     619     655     651    1107     939     753
12 Mormon             29      40      48      51      56     112      85      49
13 Muslim              6       7       9      10       9      23      16       8
14 Orthodox           13      17      23      32      32      47      38      42
15 Other Christ…       9       7      11      13      13      14      18      14
16 Other Faiths       20      33      40      46      49      63      46      40
17 Other World …       5       2       3       4       2       7       3       4
18 Unaffiliated      217     299     374     365     341     528     407     321
# … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
#   abbreviated variable names ¹​`$10-20k`, ²​`$20-30k`, ³​`$30-40k`, ⁴​`$40-50k`,
#   ⁵​`$50-75k`, ⁶​`$75-100k`, ⁷​`$100-150k`

While this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy.

This format violates the tidy form because there are variables in the columns.

  • In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.

Converting this data to tidy format would give us

library(tidyverse)

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
  mutate(religion = factor(religion), income = factor(income))
# A tibble: 180 × 3
   religion income             respondents
   <fct>    <fct>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# … with 170 more rows

Some of these functions you have seen before, others might be new to you. Let’s talk about each one in the context of the tidyverse R packages.

The “Tidyverse”

There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is often referred to as the “tidyverse” because of their dependence on and presumption of tidy data.

Note

A subset of the “Tidyverse” packages include:

  • ggplot2: a plotting system based on the grammar of graphics

  • magrittr: defines the %>% operator for chaining functions together in a series of operations on data

  • dplyr: a suite of (fast) functions for working with data frames

  • tidyr: easily tidy data with pivot_wider() and pivot_longer() functions (also separate() and unite())

A complete list can be found here (https://www.tidyverse.org/packages).

We will be using these packages quite a bit.

The “tidyverse” package can be used to install all of the packages in the tidyverse at once.

For example, instead of starting an R script with this:

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)

You can start with this:

library(tidyverse)

In the example above, let’s talk about what we did using the pivot_longer() function.

We will also talk about pivot_wider().

pivot_longer()

The tidyr package includes functions to transfer a data frame between long and wide.

  • Wide format data tends to have different attributes or variables describing an observation placed in separate columns.
  • Long format data tends to have different attributes encoded as levels of a single variable, followed by another column that contains tha values of the observation at those different levels.
Example

In the section above, we showed an example that used pivot_longer() to convert data into a tidy format.

The key problem with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.

To fix this, you can use the pivot_longer() function to gather values spread across several columns into a single column, here with the column names gathered into an income column.

Note: when gathering, exclude any columns that you do not want “gathered” (religion in this case) by including the column names with a the minus sign in the pivot_longer() function.

For example:

# Gather everything EXCEPT religion to tidy data
relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents")
# A tibble: 180 × 3
   religion income             respondents
   <chr>    <chr>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# … with 170 more rows

Even if your data is in a tidy format, pivot_longer() is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.

pivot_wider()

The pivot_wider() function is less commonly needed to tidy data. It can, however, be useful for creating summary tables.

Example

You use the summarize() function in dplyr to summarize the total number of respondents per income category.

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
  mutate(religion = factor(religion), income = factor(income)) %>% 
  group_by(income) %>% 
  summarize(total_respondents = sum(respondents)) %>%
  pivot_wider(names_from = "income", 
              values_from = "total_respondents") %>%
  knitr::kable()
<$10k >150k $10-20k $100-150k $20-30k $30-40k $40-50k $50-75k $75-100k Don’t know/refused
1930 2608 2781 3197 3357 3302 3085 5185 3990 6121

Notice in this example how pivot_wider() has been used at the very end of the code sequence to convert the summarized data into a shape that offers a better tabular presentation for a report.

Note

In the pivot_wider() call, you first specify the name of the column to use for the new column names (income in this example) and then specify the column to use for the cell values (total_respondents here).

Example of pivot_longer()

Let’s try another dataset. This data contain an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.

library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows

If we wanted to make lifeExp, pop and gdpPercap (all measurements that we observe) go from a wide table into a long table, what would we do?

# try it yourself
Example

One more! Try using pivot_longer() to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.

Afterward, use group_by() and summarize() to calculate the average revenue for each company across all years and all quarters.

Bonus: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).

df <- tibble(
  "company" = rep(1:3, each=4), 
  "year"  = rep(2006:2009, 3),
  "Q1"    = sample(x = 0:100, size = 12),
  "Q2"    = sample(x = 0:100, size = 12),
  "Q3"    = sample(x = 0:100, size = 12),
  "Q4"    = sample(x = 0:100, size = 12),
)
df
# A tibble: 12 × 6
   company  year    Q1    Q2    Q3    Q4
     <int> <int> <int> <int> <int> <int>
 1       1  2006    94    95    50    87
 2       1  2007    73    16    86    82
 3       1  2008    64    50    85    61
 4       1  2009     6     4    87     1
 5       2  2006    88    24    74    52
 6       2  2007    32    98    52    94
 7       2  2008     4    42    79     3
 8       2  2009    23    97    71    76
 9       3  2006     1    75    97    66
10       3  2007    30    87    58     5
11       3  2008    17    91    61    70
12       3  2009    14    12     1    60
# try it yourself 

separate() and unite()

The same tidyr package also contains two useful functions:

  • unite(): combine contents of two or more columns into a single column
  • separate(): separate contents of a column into two or more columns

First, we combine the first three columns into one new column using unite().

gapminder %>% 
  unite(col="country_continent_year", 
        country:year, 
        sep="_")
# A tibble: 1,704 × 4
   country_continent_year lifeExp      pop gdpPercap
   <chr>                    <dbl>    <int>     <dbl>
 1 Afghanistan_Asia_1952     28.8  8425333      779.
 2 Afghanistan_Asia_1957     30.3  9240934      821.
 3 Afghanistan_Asia_1962     32.0 10267083      853.
 4 Afghanistan_Asia_1967     34.0 11537966      836.
 5 Afghanistan_Asia_1972     36.1 13079460      740.
 6 Afghanistan_Asia_1977     38.4 14880372      786.
 7 Afghanistan_Asia_1982     39.9 12881816      978.
 8 Afghanistan_Asia_1987     40.8 13867957      852.
 9 Afghanistan_Asia_1992     41.7 16317921      649.
10 Afghanistan_Asia_1997     41.8 22227415      635.
# … with 1,694 more rows

Next, we show how to separate the columns into three separate columns using separate() using the col, into and sep arguments.

gapminder %>% 
  unite(col="country_continent_year", 
        country:year, 
        sep="_") %>% 
  separate(col="country_continent_year", 
           into=c("country", "continent", "year"), 
           sep="_")
# A tibble: 1,704 × 6
   country     continent year  lifeExp      pop gdpPercap
   <chr>       <chr>     <chr>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia      1952     28.8  8425333      779.
 2 Afghanistan Asia      1957     30.3  9240934      821.
 3 Afghanistan Asia      1962     32.0 10267083      853.
 4 Afghanistan Asia      1967     34.0 11537966      836.
 5 Afghanistan Asia      1972     36.1 13079460      740.
 6 Afghanistan Asia      1977     38.4 14880372      786.
 7 Afghanistan Asia      1982     39.9 12881816      978.
 8 Afghanistan Asia      1987     40.8 13867957      852.
 9 Afghanistan Asia      1992     41.7 16317921      649.
10 Afghanistan Asia      1997     41.8 22227415      635.
# … with 1,694 more rows

Post-lecture materials

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions
  1. Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.

  2. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

  2. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite()?

Additional Resources

Tip