Project 3

project 3
projects
Exploring album sales and sentiment of lyrics from Beyoncé and Taylor Swift
Author
Affiliation

Department of Biostatistics, Johns Hopkins

Published

October 4, 2022

Background

Due date: October 21 at 11:59pm

The goal of this assignment is to practice wrangling special data types (including dates, character strings, and factors) and visualizing results while practicing our tidyverse skills.

To submit your project

Please write up your project using R Markdown and processed with knitr. Compile your document as an HTML file and submit your HTML file to the dropbox on Courseplus. Please show all your code (i.e. make sure to set echo = TRUE) for each of the answers to each part.

Load data

The datasets for this part of the assignment comes from TidyTuesday.

Data dictionary avaialble here:

Beyoncé (left) and Taylor Swift (right)

Specifically, we will explore album sales and lyrics from two artists (Beyoncé and Taylor Swift), The data are available from TidyTuesday from September 2020, which I have provided for you below:

b_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')
ts_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')
sales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')

However, to avoid re-downloading data, we will check to see if those files already exist using an if() statement:

library(here)
if(!file.exists(here("data","b_lyrics.RDS"))){
  b_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')
  ts_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')
  sales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')
  
  # save the files to RDS objects
  saveRDS(b_lyrics, file = here("data","b_lyrics.RDS"))
  saveRDS(ts_lyrics, file = here("data","ts_lyrics.RDS"))
  saveRDS(sales, file = here("data","sales.RDS"))
}
Note

The above code will only run if it cannot find the path to the b_lyrics.RDS on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.

Let’s load the datasets

b_lyrics <- readRDS(here("data","b_lyrics.RDS"))
ts_lyrics <- readRDS(here("data","ts_lyrics.RDS"))
sales <- readRDS(here("data","sales.RDS"))

Part 1: Explore album sales

In this section, the goal is to explore the sales of studio albums from Beyoncé and Taylor Swift.

Notes

  • In each of the subsections below that ask you to create a plot, you must create a title, subtitle, x-axis label, and y-axis label with units where applicable. For example, if your axis says “sales” as an axis label, change it to “sales (in millions)”.

Part 1A

In this section, we will do some data wrangling.

  1. Use lubridate to create a column called released that is a Date class. However, to be able to do this, you first need to use stringr to search for pattern that matches things like this “(US)[51]” in a string like this “September 1, 2006 (US)[51]” and removes them. (Note: to get full credit, you must create the regular expression).
  2. Use forcats to create a factor called country (Note: you may need to collapse some factor levels).
  3. Transform the sales into a unit that is album sales in millions of dollars.
  4. Keep only album sales from the UK, the US or the World.
  5. Auto print your final wrangled tibble data frame.
# Add your solution here

Part 1B

In this section, we will do some more data wrangling followed by summarization using wrangled data from Part 1A.

  1. Keep only album sales from the US.
  2. Create a new column called years_since_release corresponding to the number of years since the release of each album from Beyoncé and Taylor Swift. This should be a whole number and you should round down to “14” if you get a non-whole number like “14.12” years. (Hint: you may find the interval() function from lubridate helpful here, but this not the only way to do this.)
  3. Calculate the most recent, oldest, and the median years since albums were released for both Beyoncé and Taylor Swift.
# Add your solution here

Part 1C

Using the wrangled data from Part 1A:

  1. Calculate the total album sales for each artist and for each country (only sales from the UK, US, and World).
  2. Using the total album sales, create a percent stacked barchart using ggplot2 of the percentage of sales of studio albums (in millions) along the y-axis for the two artists along the x-axis colored by the country.
# Add your solution here

Part 1D

Using the wrangled data from Part 1A, use ggplot2 to create a bar plot for the sales of studio albums (in millions) along the x-axis for each of the album titles along the y-axis.

Note:

  • You only need to consider the global World sales (you can ignore US and UK sales for this part).
  • The title of the album must be clearly readable along the y-axis.
  • Each bar should be colored by which artist made that album.
  • The bars should be ordered from albums with the most sales (top) to the least sales (bottom) (Note: you must use functions from forcats for this step).
# Add your solution here

Part 1E

Using the wrangled data from Part 1A, use ggplot2 to create a scatter plot of sales of studio albums (in millions) along the y-axis by the released date for each album along the x-axis.

Note:

  • The points should be colored by the artist.
  • There should be three scatter plots (one for UK, US and world sales) faceted by rows.
# Add your solution here

Part 2: Exploring sentiment of lyrics

In Part 2, we will explore the lyrics in the b_lyrics and ts_lyrics datasets.

Part 2A

Using ts_lyrics, create a new column called line with one line containing the character string for each line of Taylor Swift’s songs.

  • How many lines in Taylor Swift’s lyrics contain the word “hello”? For full credit, show all the rows in ts_lyrics that have “hello” in the line column and report how many rows there are in total.
  • How many lines in Taylor Swift’s lyrics contain the word “goodbye”? For full credit, show all the rows in ts_lyrics that have “goodbye” in the line column and report how many rows there are in total.
# Add your solution here

Part 2B

Repeat the same analysis for b_lyrics as described in Part 2A.

# Add your solution here

Part 2C

Using the b_lyrics dataset,

  1. Tokenize each lyrical line by words.
  2. Remove the “stopwords”.
  3. Calculate the total number for each word in the lyrics.
  4. Using the “bing” sentiment lexicon, add a column to the summarized data frame adding the “bing” sentiment lexicon.
  5. Sort the rows from most frequent to least frequent words.
  6. Only keep the top 25 most frequent words.
  7. Auto print the wrangled tibble data frame.
  8. Use ggplot2 to create a bar plot with the top words on the y-axis and the frequency of each word on the x-axis. Color each bar by the sentiment of each word from the “bing” sentiment lexicon. Bars should be ordered from most frequent on the top to least frequent on the bottom of the plot.
  9. Create a word cloud of the top 25 most frequent words.
# Add your solution here

Part 2D

Repeat the same analysis as above in Part 2C, but for ts_lyrics.

# Add your solution here

Part 2E

Using the ts_lyrics dataset,

  1. Tokenize each lyrical line by words.
  2. Remove the “stopwords”.
  3. Calculate the total number for each word in the lyrics for each Album.
  4. Using the “afinn” sentiment lexicon, add a column to the summarized data frame adding the “afinn” sentiment lexicon.
  5. Calculate the average sentiment score for each Album.
  6. Auto print the wrangled tibble data frame.
  7. Join the wrangled data frame from Part 1A (album sales in millions) with the wrangled data frame from #6 above (average sentiment score for each album).
  8. Using ggplot2, create a scatter plot of the average sentiment score for each album (y-axis) and the album release data along the x-axis. Make the size of each point the album sales in millions.
  9. Add a horizontal line at y-intercept=0.
  10. Write 2-3 sentences interpreting the plot answering the question “How has the sentiment of Taylor Swift’s albums have changed over time?”. Add a title, subtitle, and useful axis labels.
# Add your solution here