4 Workshop

4.1 Overview

In this workshop, you will explore spotify songs!

Please write up your solution using R Markdown and knitr. Please show all your code for each of the answers to each part.

At the end of the workshop, we will go over the answers.

4.2 Data

That data for this part of the assignment comes from TidyTuesday, which is a weekly podcast and global community activity brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.

[Source: TidyTuesday]

To access the data, you need to install the tidytuesdayR R package and use the function tt_load() with the date of ‘2020-01-21’ to load the data.

install.packages("tidytuesdayR")

This is how you can download the data.

tuesdata <- tidytuesdayR::tt_load('2020-01-21')
spotify_songs <- tuesdata$spotify_songs

However, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:

library(here)
library(tidyverse)

# tests if a directory named "data" exists locally
if(!dir.exists(here("data"))) { dir.create(here("data")) }

# saves data only once (not each time you knit a R Markdown)
if(!file.exists(here("data","spotify_songs.RDS"))) {
  url_csv <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
  spotify_songs <- readr::read_csv(url_csv)
  
  # save the file to RDS objects
  saveRDS(spotify_songs, file= here("data","spotify_songs.RDS"))
}

Here we read in the .RDS dataset locally from our computing environment:

spotify_songs <- readRDS(here("data","spotify_songs.RDS"))
as_tibble(spotify_songs)

# A tibble: 32,833 × 23
   track_id      track…¹ track…² track…³ track…⁴ track…⁵ track…⁶ playl…⁷ playl…⁸
   <chr>         <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
 1 6f807x0ima9a… I Don'… Ed She…      66 2oCs0D… I Don'… 2019-0… Pop Re… 37i9dQ…
 2 0r7CVbZTWZgb… Memori… Maroon…      67 63rPSO… Memori… 2019-1… Pop Re… 37i9dQ…
 3 1z1Hg7Vb0AhH… All th… Zara L…      70 1HoSmj… All th… 2019-0… Pop Re… 37i9dQ…
 4 75FpbthrwQmz… Call Y… The Ch…      60 1nqYsO… Call Y… 2019-0… Pop Re… 37i9dQ…
 5 1e8PAfcKUYoK… Someon… Lewis …      69 7m7vv9… Someon… 2019-0… Pop Re… 37i9dQ…
 6 7fvUMiyapMsR… Beauti… Ed She…      67 2yiy9c… Beauti… 2019-0… Pop Re… 37i9dQ…
 7 2OAylPUDDfwR… Never … Katy P…      62 7INHYS… Never … 2019-0… Pop Re… 37i9dQ…
 8 6b1RNvAcJjQH… Post M… Sam Fe…      69 6703SR… Post M… 2019-0… Pop Re… 37i9dQ…
 9 7bF6tCO3gFb8… Tough … Avicii       68 7CvAfG… Tough … 2019-0… Pop Re… 37i9dQ…
10 1IXGILkPm0tO… If I C… Shawn …      67 4Qxzbf… If I C… 2019-0… Pop Re… 37i9dQ…
# … with 32,823 more rows, 14 more variables: playlist_genre <chr>,
#   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
#   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, and abbreviated variable names ¹track_name,
#   ²track_artist, ³track_popularity, ⁴track_album_id, ⁵track_album_name,
#   ⁶track_album_release_date, ⁷playlist_name, ⁸playlist_id

We can take a glimpse at the data

glimpse(spotify_songs)

Rows: 32,833
Columns: 23
$ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
$ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
$ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
$ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
$ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
$ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
$ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
$ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
$ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
$ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
$ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
$ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
$ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
$ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
$ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
$ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
$ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
$ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
$ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
$ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
$ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
$ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

For all of the questions below, you can ignore the missing values in the dataset, so e.g. when taking averages, just remove the missing values before taking the average, if needed.

4.3 Tasks

Use functions from dplyr and ggplot2 to answer the following questions.

How many songs are in each genre?

# Add your solution here

What is average value of energy and acousticness in the latin genre in this dataset?

# Add your solution here

Calculate the average duration of song (in minutes) across all subgenres. Which subgenre has the longest song on average?

# Add your solution here

Make two boxplots side-by-side of the danceability of songs stratifying by whether a song has a fast or slow tempo. Define fast tempo as any song that has a tempo above its median value. On average, which songs are more danceable?

Hint: You may find the case_when() function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a mutate() call).

## Generate some random numbers
dat <- tibble(x = rnorm(100))
slice(dat, 1:3)

# A tibble: 3 × 1
       x
   <dbl>
1 -0.825
2  1.26 
3 -0.934

## Create a new column that indicates whether the value of 'x' is positive or negative
dat %>%
        mutate(is_positive = case_when(
                x >= 0 ~ "Yes",
                x < 0 ~ "No"
        ))

# A tibble: 100 × 2
        x is_positive
    <dbl> <chr>      
 1 -0.825 No         
 2  1.26  Yes        
 3 -0.934 No         
 4 -0.760 No         
 5  0.327 Yes        
 6 -0.282 No         
 7 -0.786 No         
 8 -2.43  No         
 9  1.13  Yes        
10 -0.329 No         
# … with 90 more rows

# Add your solution here