devtools::load_all()
devtools::document()
devtools::check()
devtools::install()Project 3
Background
Due date: December 1 at 11:59pm
The goal of this assignment is to practice some of the skills we have been learning about in class around working with relational databases, machine learning paradigms, and reproducible software development.
In Part 1, you will:
- Use a SQL database to write SQL queries to explore and summarize data using
DBIandRSQLite. - Train a prediction model using
tidymodelsusing the data from the SQL database.
In Part 2, you will:
- Create a small R package and apply some of the functions to simulated data or data from Part 1.
To submit your project
You need to create two private GitHub Classroom repositories. The links to both will be found in CoursePlus. The first link will be the primary project 03 (project03-<username>) repo where you will submit this .qmd analysis file and a rendered .html file. In this .qmd file, you will also submit a link to the second repo (project03pkg-<username>) where you will create a small R package. The reason there are two github repos is because it is easier to use and to install R packages when they are in standalone github repos.
All plots should have titles, subtitles, captions, and human-understandable axis labels. The TAs will clone both repositories and check for completeness, correctness, clarity, and documentation. The second repository will also be installed by the TAs to check the package is installable and functions run without errors.
To check your package for warnings or errors, you can use devtools:
Data
You are welcome to pick your own datasets, but here are some ideas (which you are not limited to):
- nycflights13 dataset – contains data on flights that departed NYC in 2013, including tables for weather, planes, airports, and airlines. This dataset is useful for exploring what causes delays. To practice SQL queries, you could consider aggregations and group by queries across airlines or days of the week. To practice machine learning, you could predict flight delays. Also, here are two related datasets you might interesting too:
- anyflights for flights departing from any airport in any year.
- IMDB dataset – Contains data on movies, including tables for titles, ratings, cast, and crew. This dataset is useful for exploring movie information, reviews, and ratings. To practice SQL queries, you could consider nested queries, joins, filtering, and text-based queries on genres or keywords. To practice machine learning, you could predict movie ratings.
- Lahman Baseball dataset – contains historical baseball data, with tables for teams, players, games, and statistics. To practice SQL queries, you could consider aggregation, and time-based queries to calculate baseball statistics. To practice machine learning, you could predict team performance or predict player statistics, such as the number of runs a team will score.
- World Bank Open datasets – contains extensive economic and social data on countries worldwide, such as population, GDP, education, life expectancy, and health indicators. To practice SQL queries, you could perform time-based queries such as filtering by regions or income levels, and aggregating metrics by country or year. To practice machine learning, you could predict life expectancy.
- You must use or create an SQL database with at least three related tables.
- You will use this data throughout the project.
Useful questions to consider when deciding on the dataset:
- What questions can we ask with SQL?
- What prediction task makes sense for this data?
Part 1
Part 1A: SQL data exploration and visualization
Use DBI and RSQLite to connect to or create a SQL database. Explore and visualize the data and summarize your findings using SQL queries.
Using the data you picked, choose a question to investigate. For example, if you are using the nycflights13 dataset, maybe you want to understand if there are certain days of the week that flights are more delayed than other days of the week. Describe what is the question you aim to answer with the data and what you want to visualize.
Create or connect to the database. If it is not already a SQL database, create a database and add at least 3 tables using
dbWriteTable().Write three SQL queries to explore and summarize the data. With the
dbGetQuery()function, use at least five different SQL functions (e.g.SELECT,GROUP BY,DISTINCT,SUM,AVG,JOINetc).Write two SQL queries to create new features inside the database. For example, if using
nycflights13, you could think about how to- Bin departure times into time-of-day categories (e.g., morning, afternoon, evening).
- Lag features like the previous day’s average delay by carrier, which can be helpful for predictions.
- Merge additional weather data (such as hourly temperature, precipitation, etc., if available). e.g. this could be done using SQL joins.
Visualize your results using
ggplot2. For example, if usingnycflights13, you could think about how to visualize delays by month, carrier, or weather conditions.Summarize your methods and key findings in a short paragraph. What did you learn from your SQL exploration? Include any limitations or potential biases in your analysis. Be sure to comment and organize your code so is easy to understand what you are doing.
At the end of the data analysis, list out each of the SQL functions you used to help the TA with respect to making sure you met all the requirements described above.
## add your code here Part 1B: Building a machine learning model
Use tidymodels to train and evaluate a predictive model for a task or outcome, perform model evaluation, and summarize your findings. For this part, you can use the data directly in R as standard data.frames or tibbles as input to the models.
- Split the data into training and test sets. For example, you can use a simple split into 80% and 20% or you can also use cross-validation approaches, if you wish.
- Choose an appropriate machine learning model appropriate for the task (e.g. consider if it’s a binary or continuous outcome and choose an appropriate model). For example, if you are using the
nycflights13dataset, you could could pick- Logistic Regression (if predicting delay as a binary outcome, e.g., delay/no delay).
- Linear Regression (if predicting the length of delay in minutes).
- Decision Trees or Random Forests (for both binary classification or regression).
- Train the model
tidymodelsand and evaluate it using an appropriate evaluation metric, such as (i) accuracy or AUC for classification, (ii) root mean squared error (RMSE) for regression. Show this for both the training and the test data. - Visualize model results or feature importance.
- Summarize your methods and and key findings. For example, if you are predicting flight delays, provide 1-2 practical recommendations for reducing delays. What features matter most? How well does the model perform? What could improve performance? Be sure to comment and organize your code so is easy to understand what you are doing.
## add your code here Part 2
In this part of the project, we will practice building a small R package with simple functions. In addition to constructing the functions in the package, you will also demonstrate applying your functions with the data from Part 1.
- The link to create a private GitHub repository for yourself to complete Part 2 will be posted in CoursePlus.
- Build your R package locally and then push the files to the private Github repository that you created for yourself via GitHub Classroom.
- The TA will grade the R package by cloning the repository, installing it, and checking for all the things described below. It must be installable without any errors.
Link to your R package on Github classroom: [Delete this text and replace the text with the link to the private GitHub repo you created above for your package]
Part 2A: Cosine and sine transformation
The cosine and sine of a number can be written as an infinite series expansion of the form
\[ \cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} \cdots \]
\[ \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} \cdots \]
Write two functions that compute the cosine and sine (respectively) of a number using the truncated series expansion. Each function should take two arguments:
x: the number to be transformedk: the number of terms to be used in the series expansion beyond the constant 1. The value ofkis always \(\geq 1\).
- You can assume that the input value
xwill always be a single number. - You can assume that the value
kwill always be an integer \(\geq 1\). - Do not use the
cos()orsin()functions in R.
fn_cos <- function(x, k) {
# Add your solution here
}
fun_sin <- function(x, k) {
# Add your solution here
}Part 2B: Calculating confidence intervals
Write the following set of functions:
sample_mean(), which calculates the sample mean
\[ \bar{x} = \frac{1}{N} \sum_{i=1}^n x_i \]
sample_sd(), which calculates the sample standard deviation
\[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2} \]
calculate_CI(), which calculates the confidence intervals of a sample mean and returns a named vector of length 2, where the first value is thelower_bound, the second value is theupper_bound.
\[ \bar{x} \pm t_{\alpha/2, N-1} s_{\bar{x}} \]
- You can assume that the input value
xwill always be a vector of numbers of length N. - Do not use the
mean()andsd()functions in R.
sample_mean <- function(x) {
# Add your solution here
}
sample_sd <- function(x) {
# Add your solution here
}
calculate_CI <- function(x, conf = 0.95) {
# Add your solution here
}Part 2C: Put functions into an R package
Create an R package for the functions you wrote from Part 2A and 2B. Your package will have three exported functions for users to call (see below). You will need to write documentation for each function that you export using roxygen2 comments (e.g., #' @param, #' @return, #' @examples).
Your package should include the functions:
fn_cos(), which computes the approximation to the cosine function (exported)fn_sin(), which computes the approximation to the sine function (exported)sample_mean(), which calculates the sample mean (not exported)sample_sd(), which calculates the sample standard deviation (not exported)calculate_CI(), which calculates the confidence intervals from simulated data (exported)
- Remember that you should only export the functions that you want the user to use.
- Functions that are not exported do not require any documentation.
- Each exported function should have at least one example of its usage (using the
@exampledirective in the documentation). - In the functions in your package, consider using control structures and include checks (e.g.
is.na(),is.numeric(),if()) to make sure the input is as you expect it to be. For example, try to break the the function with unexpected values that a user might provide (e.g. providing a negative value to a log transformation). This can help guide you on ways to address the possible ways to break the function. - Your package should be installable without any warnings or errors.
Part 2D: Create a README.md file
Create a README.md file in the R package, which will be useful to readers when they learn about your package. The readme must include:
- The title of package
- The author of the package
- A goal / description of the package
- A list of exported functions that are in the package. Briefly describe each function.
- A basic example with one of the functions.
You might find the use_readme_md() function from the usethis R package helpful.
Part 2E: Create a vignette
In this part, you will create a vignette where you demonstrate the functions in your R package. Specifically, you will create a .qmd file and put it in a folder called /vignettes within your R package. The purpose of a vignette is to demonstrate the functions of your package in a longer tutorial instead of just short examples within the documentation of your functions (i.e. using the @example directive in the documentation).
You might find the use_vignette() function from the usethis R package helpful.
Demonstrate fn_cos()
In the vignette, make a plot and show the output of your function fn_cos(x,k) and how it approximates the cos(x) function from base R as \(k\) increases.
- The x-axis should range between 0 and 10.
- The y-axis should be the output from
fn_cos(x,k)orcos(x). - Plot the output from
cos(x)as points on the graph. - Plot the output from
fn_cos(x,k)as lines on the graph. - Show 5 lines for values
k= 1, 3, 5, 7, 9. Each line should be a different color.
Demonstrate fn_sin()
Repeat a similar task and make a similar plot as above, but here using fn_sin() instead of fn_cos().
Demonstrate calculate_CI()
The goal here is to demonstrate the calculate_CI() function in your package inside the vignette. You can use simulated data or some subset of your data used in Part 1.