Why it’s best to keep software and data analysis repositories separate

data analysis
Author

Stephanie C. Hicks

Published

February 27, 2022

Whenever I am wrapping up a project for work (e.g. writing up a manuscript to submit), it’s often the case that the project as a whole contains both of the following:

  1. A set of scripts (e.g. .Rmd files) that analyzes some specific data.
  2. A software package (e.g R package) often generalizing some of the functions from #1 that someone else might be able to use to help in their own data analysis.

Both of these are incredibly useful for many reasons, but in different ways.

For example, in the first category, the idea is to produce set of scripts to enable reproducibility of the specific analyses presented in the project (e.g. peer-reviewed manuscript). Also, this is a way to provide larger details on the motivation or intentional decisions behind the choices made in the data analysis. There are a variety of reasons someone down the line, including myself 6 months or 6 years from now, might be interested in the exact code written in this project, including debugging something, wanting to build upon the work, or challenging the work.

In the second category, the idea is that I have identified a piece of code that can be generalized into a function that might be useful to someone else, including myself, for a future project. There are many best practices behind developing R packages, but one that wasn’t very clear to me at first when I starting writing my own software was:

Software and data analysis repositories are not the same and should be kept in separate places.

The problem?

Let me give you an example of something I’ve seen lately. Someone publishes a pre-print or a paper with a link to a GitHub repository to the foobar software package. I click on the link see the following structure:


foobar project directory
  |-- .here
  |  o-- object of type(s):file
  |-- README.md
  |  o-- object of type(s):file
  |-- NAMESPACE
  |  o-- object of type(s):file
  |-- DESCRIPTION
  |  o-- object of type(s):file
  |-- LICENSE
  |  o-- object of type(s):file
  |-- foobar.Rproj
  |  o-- object of type(s):file
  |-- R/
  |  o-- object of type(s):dir
  |-- man/
  |  o-- object of type(s):dir
  |-- vignettes/
  |  o-- object of type(s):dir
  |-- preprocessing_for_paper/
  |  o-- object of type(s):dir
  |-- simulations_for_paper/
  |  o-- object of type(s):dir
  |-- real_analyses_for_paper/
  |  o-- object of type(s):dir
  |-- fig_scripts_for_paper/
  |  o-- object of type(s):dir
  o-- data_for_paper/
     o-- object of type(s):dir

The problem with this project directory is that it is mixing up both #1 and #2 above and placing everything in one directory. This places an undue burden on a potential user to have to sift through these two types of code and depending on what they are interested in using, they might have to make modifications for how the software is installed.

How to improve?

A better way to go is to keep these two in separate directories or repositories called foobar and foobar_paper. Specifically, the scripts to reproduce the analyses should be place in one repository (e.g. called foobar_paper or foorbar_project):


foobar paper analysis directory
  |-- .here
  |  o-- object of type(s):file
  |-- README.md
  |  o-- object of type(s):file
  |-- preprocessing_for_paper/
  |  o-- object of type(s):dir
  |-- simulations_for_paper/
  |  o-- object of type(s):dir
  |-- real_analyses_for_paper/
  |  o-- object of type(s):dir
  |-- fig_scripts_for_paper/
  |  o-- object of type(s):dir
  o-- data_for_paper/
     o-- object of type(s):dir

and the software that generalizes some code into a function for others to use should be placed in a different repository (e.g. called foobar, etc):


foobar software directory
  |-- .here
  |  o-- object of type(s):file
  |-- README.md
  |  o-- object of type(s):file
  |-- NAMESPACE
  |  o-- object of type(s):file
  |-- DESCRIPTION
  |  o-- object of type(s):file
  |-- LICENSE
  |  o-- object of type(s):file
  |-- foo_package.Rproj
  |  o-- object of type(s):file
  |-- R/
  |  o-- object of type(s):dir
  |-- man/
  |  o-- object of type(s):dir
  o-- vignettes/
     o-- object of type(s):dir

This is good for a few reasons including:

  1. It keeps the two separate, but equally important, concepts of specific analysis code vs generalized code in different places, with links between the two, if useful.
  2. It keeps the repositories as lightweight as possible.
  3. Installation of the foobar software package is more straightforward.
  4. It enables you to potentially create new generalized functions in the future (i.e. make new software packages) that might have been originally derived from this project.
  5. (Update 2022-03-02): It makes it more likely that the foobar software package is reusable independently of the paper code or context (h/t Rahul Karnik).

Alternatives

(Update 2022-03-02): Konrad Rudolph made a great point that there are alternative approaches to these nested package approaches that enable organizing R code in a modular way without the ned to wrap code into a formal R package. For example, using modules from box. The idea is you can write “modular code by treating files and folders of R code as independent (potentially nested) modules, without requiring the user to wrap reusable code into packages”.

Acknowledgements

Thank you to the listdown R package from Michael Kane, which automated the folder structures above.