Tidy data and the Tidyverse

module 1 week 2 R Programming tidyr here tidyverse

Introduction to tidy data and how to convert between wide and long data with the tidyr R package.

Stephanie Hicks https://stephaniehicks.com/ (Department of Biostatistics, Johns Hopkins)https://www.jhsph.edu
09-09-2021

“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Pre-lecture materials

Read ahead

Before class, you can prepare by reading the following materials:

  1. Tidy Data paper published in the Journal of Statistical Software
  2. https://r4ds.had.co.nz/tidy-data.html
  3. tidyr cheat sheet from RStudio

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

At the end of this lesson you will:

Tidy data

As we learned in the last lesson, one unifying concept of the tidyverse is the notion of tidy data. As defined by Hadley Wickham in his 2014 paper published in the Journal of Statistical Software, a tidy dataset has the following properties:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

Artwork by Allison Horst on tidy data

Figure 1: Artwork by Allison Horst on tidy data

[Source: Artwork by Allison Horst]

The purpose of defining tidy data is to highlight the fact that most data do not start out life as tidy. In fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience). Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.

As a quick example, consider the following data illustrating religion and income survey data with the number of respondees with income range in column name. This is in a classic table format:

library(tidyr)
relig_income
# A tibble: 18 × 11
   religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Agnostic       27        34        60        81        76       137
 2 Atheist        12        27        37        52        35        70
 3 Buddhist       27        21        30        34        33        58
 4 Catholic      418       617       732       670       638      1116
 5 Don’t kn…      15        14        15        11        10        35
 6 Evangeli…     575       869      1064       982       881      1486
 7 Hindu           1         9         7         9        11        34
 8 Historic…     228       244       236       238       197       223
 9 Jehovah'…      20        27        24        24        21        30
10 Jewish         19        19        25        25        30        95
11 Mainline…     289       495       619       655       651      1107
12 Mormon         29        40        48        51        56       112
13 Muslim          6         7         9        10         9        23
14 Orthodox       13        17        23        32        32        47
15 Other Ch…       9         7        11        13        13        14
16 Other Fa…      20        33        40        46        49        63
17 Other Wo…       5         2         3         4         2         7
18 Unaffili…     217       299       374       365       341       528
# … with 4 more variables: $75-100k <dbl>, $100-150k <dbl>,
#   >150k <dbl>, Don't know/refused <dbl>

While this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy. This format violates the tidy form because there are variables in the columns. In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.

Converting this data to tidy format would give us

library(tidyverse)

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
  mutate(religion = factor(religion), income = factor(income))
# A tibble: 180 × 3
   religion income             respondents
   <fct>    <fct>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# … with 170 more rows

Some of these functions you have seen before, others might be new to you. Let’s talk about each one in the context of the Tidyverse package.

The “Tidyverse”

There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is sometimes referred to as the “tidyverse” because of their dependence on and presumption of tidy data. “Tidyverse” packages include:

We will be using these packages quite a bit this week.

The “tidyverse” package can be used to install all of the packages in the tidyverse at once. For example, instead of starting an R script with this:

You can start with this:

In the code above, let’s talk about what we did using the pivot_longer() function. We will also talk about pivot_wider().

pivot_longer() and pivot_wider()

The tidyr package includes functions to transfer a data frame between long and wide.

In the section above, we showed an example that used pivot_longer() to convert data into a tidy format.

The key problem with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.

To fix this, you can use the pivot_longer() function to gather values spread across several columns into a single column, with the column names gathered into an income column. When gathering, exclude any columns that you do not want “gathered” (religion in this case) by including the column names with a the minus sign in the pivot_longer() function. For example:

# Gather everything EXCEPT religion to tidy data
relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents")
# A tibble: 180 × 3
   religion income             respondents
   <chr>    <chr>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# … with 170 more rows

Even if your data is in a tidy format, pivot_longer() is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.

The pivot_wider() function is less commonly needed to tidy data. It can, however, be useful for creating summary tables. For example, you use the summarize() function in dplyr to summarize the total number of respondents per income category.

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
  mutate(religion = factor(religion), income = factor(income)) %>% 
  group_by(income) %>% 
  summarize(total_respondents = sum(respondents)) %>%
  pivot_wider(names_from = "income", 
              values_from = "total_respondents") %>%
  knitr::kable()
<$10k >150k $10-20k $100-150k $20-30k $30-40k $40-50k $50-75k $75-100k Don’t know/refused
1930 2608 2781 3197 3357 3302 3085 5185 3990 6121

Notice in this example how pivot_wider() has been used at the very end of the code sequence to convert the summarized data into a shape that offers a better tabular presentation for a report. In the pivot_wider() call, you first specify the name of the column to use for the new column names (income in this example) and then specify the column to use for the cell values (total_respondents here).

Example: Let’s try another dataset. This data contain an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.

library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows

If we wanted to make lifeExp, pop and gdpPercap (all measurements that we observe) go from a wide table into a long table, what would we do?

# try it yourself

One more! Try using pivot_longer() to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.

Afterward, use group_by() and summarize() to calculate the average revenue for each company across all years and all quarters.

Bonus: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).

df <- tibble(
  "company" = rep(1:3, each=4), 
  "year"  = rep(2006:2009, 3),
  "Q1"    = sample(x = 0:100, size = 12),
  "Q2"    = sample(x = 0:100, size = 12),
  "Q3"    = sample(x = 0:100, size = 12),
  "Q4"    = sample(x = 0:100, size = 12),
)
df
# A tibble: 12 × 6
   company  year    Q1    Q2    Q3    Q4
     <int> <int> <int> <int> <int> <int>
 1       1  2006   100    90    24    86
 2       1  2007    23    29    30    67
 3       1  2008    42    39    53    77
 4       1  2009    98    98    60    87
 5       2  2006     8    97    17    73
 6       2  2007    15    83    18     8
 7       2  2008     7    12    38    72
 8       2  2009    22    49    99    82
 9       3  2006    65    28    39    22
10       3  2007    20    33    14    56
11       3  2008    56    78    72    44
12       3  2009    36    67    91    42
# try it yourself 

separate() and unite()

The same tidyr package also contains two useful functions:

First, we combine the first three columns into one new column using unite().

gapminder %>% 
  unite(col="country_continent_year", country:year, sep="_")
# A tibble: 1,704 × 4
   country_continent_year lifeExp      pop gdpPercap
   <chr>                    <dbl>    <int>     <dbl>
 1 Afghanistan_Asia_1952     28.8  8425333      779.
 2 Afghanistan_Asia_1957     30.3  9240934      821.
 3 Afghanistan_Asia_1962     32.0 10267083      853.
 4 Afghanistan_Asia_1967     34.0 11537966      836.
 5 Afghanistan_Asia_1972     36.1 13079460      740.
 6 Afghanistan_Asia_1977     38.4 14880372      786.
 7 Afghanistan_Asia_1982     39.9 12881816      978.
 8 Afghanistan_Asia_1987     40.8 13867957      852.
 9 Afghanistan_Asia_1992     41.7 16317921      649.
10 Afghanistan_Asia_1997     41.8 22227415      635.
# … with 1,694 more rows

Next, we show how to separate the columns into three separate columns using separate() using the col, into and sep arguments.

gapminder %>% 
  unite(col="country_continent_year", country:year, sep="_") %>% 
  separate(col="country_continent_year", into=c("country", "continent", "year"), sep="_")
# A tibble: 1,704 × 6
   country     continent year  lifeExp      pop gdpPercap
   <chr>       <chr>     <chr>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia      1952     28.8  8425333      779.
 2 Afghanistan Asia      1957     30.3  9240934      821.
 3 Afghanistan Asia      1962     32.0 10267083      853.
 4 Afghanistan Asia      1967     34.0 11537966      836.
 5 Afghanistan Asia      1972     36.1 13079460      740.
 6 Afghanistan Asia      1977     38.4 14880372      786.
 7 Afghanistan Asia      1982     39.9 12881816      978.
 8 Afghanistan Asia      1987     40.8 13867957      852.
 9 Afghanistan Asia      1992     41.7 16317921      649.
10 Afghanistan Asia      1997     41.8 22227415      635.
# … with 1,694 more rows

Post-lecture materials

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions:

  1. Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.

  2. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

  2. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite()?

Additional Resources

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Hicks (2021, Sept. 9). Statistical Computing: Tidy data and the Tidyverse. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/posts/2021-09-07-tidy-data-and-the-tidyverse/

BibTeX citation

@misc{hicks2021tidy,
  author = {Hicks, Stephanie},
  title = {Statistical Computing: Tidy data and the Tidyverse},
  url = {https://stephaniehicks.com/jhustatcomputing2021/posts/2021-09-07-tidy-data-and-the-tidyverse/},
  year = {2021}
}