Setting up a reproducible data analysis project in R

featuring GitHub, {renv}, {targets} and more

Olivia Angelin-Bonnet

The New Zealand Institute for Plant and Food Research

20 August 2024

The project

Typical data science project would include:

  • cleaning the data
  • creating some visualisations for EDA
  • fitting a model
  • generating a report for the stakeholders  

We will analyse the palmerpenguins dataset: size measurements for three penguin species observed in the Palmer Archipelago (Antarctica)

Artwork by @allison_horst

Outline

Project setup

  • Creating a GitHub repository

  • Turning it into an R project

  • Initialising {renv} for the project

  • Setting up the directory structure

  • Populating the README

Tidying code

  • Modularise code into functions

  • Turning code into a {targets} pipeline

Bonus

  • Set up testing of input data with {assertr}

  • Automate report rendering with Quarto + {targets}

Let’s get started!

Creating a new GitHub repository

Many ways to do this!

  • create a new repository online, from github.com

  • create a new folder locally, then push it online

  • use a GitHub GUI (e.g. GitHub Desktop)

 

Reference

For more information, refer to Happy Git and GitHub for the useR (chapters 14 to 16).

Creating a new GitHub repository

Cloning the repository

If you created the repository online, you need to clone it to your machine – again, several ways to do so.

Need to get our git credentials sorted!

Using usethis::create_from_github() function:

  • clones the repository to a directory of your choice

  • creates an RStudio project in the cloned directory

  • updates the .gitignore file

  • opens the RStudio project in a new RStudio session

Cloning the repository

Initialising {renv}

The {renv} package helps to create reproducible environments for R projects:

  • records which packages and which versions we are using in the project

  • isolates the packages library we use in the project from the rest of the computer

 

Set-up {renv} in our directory by running renv::init():

  • creates a renv.lock file that records all dependencies (needs to be tracked on GitHub)

  • creates a renv/ folder, inside which packages used in the project will be installed (will not be tracked on GitHub)

Initialising {renv}

Using {renv} in four commands

  • To install a package: renv::install("tidyverse")

  • Before every commit:

    • check whether lock file is up-to-date: renv::status()

    • if not, record the new dependencies: renv::snapshot()

  • To install dependencies recorded in the lockfile: renv::restore()

Setting up directory structure

Always follows the same organisation – makes it easy for people to understand and find things:

  • data/: where the data necessary for the project will live (if using local version of data)
  • analysis/: where to put the analysis scripts
  • R/: where to put scripts with helper functions
  • output/: where to save generated tables, figures, etc.
  • reports/: where to put the files that will be used to generate reports

References

For more information about organising your R project folders, see Marwick et al. (2017) “Packaging Data Analytical Work Reproducibly Using R (and Friends)” and the {rrtools} package.

Setting up directory structure

After creating the folders, here is what our directory looks like:

.
├── .git/
│   └── ...
├── .Rproj.user/
│   └── ...
├── renv/
│   └── ...
├── data/
│   └── penguins_raw.csv
├── analysis/
├── R/
├── reports/
├── output/
├── README.md
├── renv.lock
├── palmerpenguins_analysis.Rproj
├── .Rprofile
└── .gitignore

Populating the README

README should contain a minimal set of information about the project:

  • a brief description of the project: context, overall aim, some key information about the experiment and the analyses intended/performed

  • a list of key contributors and their roles, as bullet points

  • information about the input data: where it is stored, who provided it, what it contains

  • what analyses were performed, and which files were generated as a result (e.g. cleaned version of the data)

  • a list of the key folders and files

  • which sets of commands to run if you want to reproduce the analysis

Populating the README

README.md
# Analysis of penguins measurements

This project aims at understanding the differences between the size of three species of penguins (Adelie, 
Chinstrap and Gentoo) observed in the Palmer Archipelago, Antarctica. The data was collected by Dr Kristen
Gorman between 2007 and 2009.

## Key contributors

- Jane Doe: project leader

- Kristen Gorman: data collection

- Olivia Angelin-Bonnet: data analysis

## Input data

The raw penguins measurements, stored in the `penguins_raw.csv` file, were downloaded from the 
[`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/index.html).

Data source: [...]

## Analysis

- Data cleaning: a cleaned version of the dataset was saved as `penguins_cleaned.csv` on
[OneDrive](path/to/OneDrive/folder).

- *To be filled*

## Repository content

- `renv.lock`: list of packages used in the analysis (and their version)
- `analysis/`: collection of `R` scripts containing the analysis code
- `R/`: collection of `R` scripts containing helper functions used for the analysis
- `reports/`: collection of `Quarto` documents used to generate the reports

## How to reproduce the analysis

```{r}
# Install the necessary packages
renv::restore()
```

Commit and push

Start coding!

I have written some code…

analysis/first_script.R
library(tidyverse)
library(janitor)
library(ggbeeswarm)
library(here)

# Reading and cleaning data
penguins_df <- read_csv(here("data/penguins_raw.csv"), show_col_types = FALSE) |> 
  clean_names() |> 
  mutate(
    species = word(species, 1),
    year = year(date_egg),
    sex = str_to_lower(sex),
    year = as.integer(year),
    body_mass_g = as.integer(body_mass_g),
    across(where(is.character), as.factor)
  ) |> 
  select(
    species,
    island,
    year,
    sex,
    body_mass_g,
    bill_length_mm = culmen_length_mm,
    bill_depth_mm = culmen_depth_mm,
    flipper_length_mm
  ) |> 
  drop_na()

## Violin plot of body mass per species and sex
penguins_df |> 
  ggplot(aes(x = species, colour = sex, fill = sex, y = body_mass_g)) +
  geom_violin(alpha = 0.3, scale = "width") +
  geom_quasirandom(dodge.width = 0.9) +
  scale_colour_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal()

## Violin plot of flipper length per species and sex
penguins_df |> 
  ggplot(aes(x = species, colour = sex, fill = sex, y = flipper_length_mm)) +
  geom_violin(alpha = 0.3, scale = "width") +
  geom_quasirandom(dodge.width = 0.9) +
  scale_colour_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal()

## Scatter plot of bill length vs depth, with species and sex
penguins_df |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, colour = species, shape = sex)) +
  geom_point() +
  scale_colour_brewer(palette = "Set2") +
  theme_minimal()

I have written some code…

Good start, but often:

  • many more steps in the analysis \(\rightarrow\) script becomes long and convoluted

  • harder to get an overview of the analysis, and to find things

  • don’t want to re-run all the code everytime I make a change

Solution

Turn the code into a {targets} pipeline!

Step 1: Turn your code into functions

From:

# Reading and cleaning data
penguins_df <- read_csv(
  here("data/penguins_raw.csv"), 
  show_col_types = FALSE
) |> 
  clean_names() |> 
  mutate(
    species = word(species, 1),
    year = year(date_egg),
    sex = str_to_lower(sex),
    year = as.integer(year),
    body_mass_g = as.integer(body_mass_g),
    across(where(is.character), as.factor)
  ) |> 
  select(
    ## all relevant columns
  ) |> 
  drop_na()

To:

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
  janitor::clean_names() |> 
  dplyr::mutate(
    species = stringr::word(species, 1),
    year = lubridate::year(date_egg),
    sex = stringr::str_to_lower(sex),
    year = as.integer(year),
    body_mass_g = as.integer(body_mass_g),
    dplyr::across(
      dplyr::where(is.character), 
      as.factor
    )
  ) |> 
  dplyr::select(
    ## all relevant columns
  ) |> 
  tidyr::drop_na()
}

Step 1: Turn your code into functions

Don’t forget to document your functions! ({roxygen}-style)

#' Read and clean data
#' 
#' Reads in the penguins data, renames and selects relevant columns. The
#' following transformations are applied to the data: 
#' * only keep species common name
#' * extract observation year
#' * remove rows with missing values
#' 
#' @param file Character, path to the penguins data .csv file.
#' @returns A tibble.
read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
  janitor::clean_names() |> 
  dplyr::mutate(
    ## modifying columns
  ) |> 
  dplyr::select(
    ## all relevant columns
  ) |> 
  tidyr::drop_na()
}

Step 1: Turn your code into functions

Improved script:

R/helper_functions.R
#' Read and clean data
#' 
#' ...
read_data <- function(file) { ... }

#' Violin plot of variable per species and sex
#' 
#' ...
violin_plot <- function(df, yvar) { ... }

#' Scatter plot of bill length vs depth
#' 
#' ...
plot_bill_length_depth <- function(df) { ... }
analysis/first_script.R
library(here)

source(here("R/helper_functions.R"))

penguins_df <- read_data(
  here("data/penguins_raw.csv")
)

body_mass_plot <- violin_plot(
  penguins_df, 
  body_mass_g
)

flipper_length_plot <- violin_plot(
  penguins_df, 
  flipper_length_mm
)

bill_scatterplot <- plot_bill_length_depth(
  penguins_df
)

Step 2: Turn your main script into a {targets} pipeline!

From:

library(here)

source(here("R/helper_functions.R"))

penguins_df <- read_data(here("data/penguins_raw.csv"))

body_mass_plot <- violin_plot(penguins_df, body_mass_g)

flipper_length_plot <- violin_plot(penguins_df, flipper_length_mm)

bill_scatterplot <- plot_bill_length_depth(penguins_df)

Step 2: Turn your main script into a {targets} pipeline!

To:

library(targets)
library(here)

source(here("R/helper_functions.R"))

list(
  tar_target(penguins_raw_file, here("data/penguins_raw.csv"), format = "file"),
  
  tar_target(penguins_df, read_data(penguins_raw_file)),

  tar_target(body_mass_plot, violin_plot(penguins_df, body_mass_g)),

  tar_target(flipper_length_plot, violin_plot(penguins_df, flipper_length_mm)),

  tar_target(bill_scatterplot, plot_bill_length_depth(penguins_df))
)

Step 2: Turn your main script into a {targets} pipeline!

Aside: where should my targets script live?

  • default would be in the _targets.R file in the main directory

  • to choose a custom folder and file name, need to specify targets configuration

 

In the console, run (from main directory):

targets::tar_config_set(script = "analysis/_targets.R", store = "analysis/_targets")

Will create a _targets.yaml file:

_target.yaml
main:
  script: analysis/_targets.R
  store: analysis/_targets

Visualise your pipeline

In the console, run:

targets::tar_visnetwork()

Execute your pipeline

In the console, run:

targets::tar_make()

here() starts at C:/Users/hrpoab/Desktop/GitHub/palmerpenguins_analysis
> dispatched target penguins_raw_file
o completed target penguins_raw_file [0 seconds]
> dispatched target penguins_df
o completed target penguins_df [0.85 seconds]
> dispatched target body_mass_plot
o completed target body_mass_plot [0.16 seconds]
> dispatched target bill_scatterplot
o completed target bill_scatterplot [0.02 seconds]
> dispatched target flipper_length_plot
o completed target flipper_length_plot [0.02 seconds]
> ended pipeline [1.31 seconds]

Get the pipeline results

targets::tar_read(bill_scatterplot)

Change in a step

Hi Olivia,

Great work! Just a minor comment, could you change the colours in the bill length/depth scatter-plot? It’s hard to see the difference between the species.

 

R/helper_functions.R
plot_bill_length_depth <- function(df) {
  df |> 
    ggplot2::ggplot(
      ggplot2::aes(
        x = bill_length_mm, 
        y = bill_depth_mm, 
        colour = species, 
        shape = sex
        )
    ) +
    ggplot2::geom_point() +
    ggplot2::scale_colour_brewer(palette = "Set2") +
    ggplot2::theme_minimal()
}

Change in a step

Hi Olivia,

Great work! Just a minor comment, could you change the colours in the bill length/depth scatter-plot? It’s hard to see the difference between the species.

 

R/helper_functions.R
plot_bill_length_depth <- function(df) {
  df |> 
    ggplot2::ggplot(
      ggplot2::aes(
        x = bill_length_mm, 
        y = bill_depth_mm, 
        colour = species, 
        shape = sex
        )
    ) +
    ggplot2::geom_point() +
    ggplot2::scale_colour_brewer(palette = "Set1") +
    ggplot2::theme_minimal()
}

Change in a step

targets::tar_visnetwork()

Change in a step

targets::tar_make()

here() starts at C:/Users/hrpoab/Desktop/GitHub/palmerpenguins_analysis
v skipped target penguins_raw_file
v skipped target penguins_df
v skipped target body_mass_plot
> dispatched target bill_scatterplot
o completed target bill_scatterplot [0.35 seconds]
v skipped target flipper_length_plot
> ended pipeline [0.53 seconds]

Change in a step

targets::tar_read(bill_scatterplot)

Change in the data

Hi Olivia,

Oopsie! We realised there was a mistake in the original data file. Here is the updated spreadsheet, could you re-run the analysis with this version?

targets::tar_visnetwork()

Change in the data – using {assertr} for data checking

Problem: a change in the data format (column names or format, range of values, etc) might cause the pipeline to fail

Solution

Use {assertr} to check that assumptions about the data are correct!

Using {assertr} for data checking

 

Adding the following checks:

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    ## data cleaning code
}

Using {assertr} for data checking

 

Adding the following checks:

  • Are all necessary columns present in the dataset?
read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    assertr::verify(
      assertr::has_all_names(
        "species", "island", "date_egg", "sex",
        "body_mass_g", "culmen_length_mm",
        "culmen_depth_mm", "flipper_length_mm"
      )
    ) |>
    ## data cleaning code
}

Using {assertr} for data checking

 

Adding the following checks:

  • Are all necessary columns present in the dataset?

  • Is the body mass already an integer?

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    assertr::verify(
      assertr::has_all_names(
        "species", "island", "date_egg", "sex",
        "body_mass_g", "culmen_length_mm",
        "culmen_depth_mm", "flipper_length_mm"
      )
    ) |>
    assertr::assert(rlang::is_integerish, body_mass_g) |>
    ## data cleaning code
}

Using {assertr} for data checking

 

Adding the following checks:

  • Are all necessary columns present in the dataset?

  • Is the body mass already an integer?

  • Are all flipper length values positive (after removing NAs)?

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    assertr::verify(
      assertr::has_all_names(
        "species", "island", "date_egg", "sex",
        "body_mass_g", "culmen_length_mm",
        "culmen_depth_mm", "flipper_length_mm"
      )
    ) |>
    assertr::assert(rlang::is_integerish, body_mass_g) |>
    ## data cleaning code
    assertr::verify(flipper_length_mm > 0)
}

Note

More on data checking in Danielle Navarro’s blog post on Four ways to write assertion checks in R.

Writing a report with Quarto

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format:
  docx:
    number-sections: true
---

```{r setup}
#| include: false

library(knitr)

opts_chunk$set(echo = FALSE)
```

This project aims at understanding the differences between the size of three species of penguins (Adelie, Chinstrap and Gentoo) observed in the Palmer Archipelago, Antarctica, using data collected by Dr Kristen Gorman between 2007 and 2009.

## Distribution of body mass and flipper length

@fig-body-mass shows the distribution of body mass (in grams) across the three penguins species. We can see that on average, the Gentoo penguins are the heaviest, with Adelie and Chinstrap penguins more similar in terms of body mass. Within a species, the females are on average lighter than the males.

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

# code for plot
```

Similarly, Gentoo penguins have the longest flippers on average (@fig-flipper-length), and Adelie penguins the shortest. Again, females from a species have shorter flippers on average than the males.

```{r fig-flipper-length}
#| fig-cap: "Distribution of penguin flipper length (mm) across species and sex."

# code for plot
```


## Association between bill length and depth

In this dataset, bill measurements refer to measurements of the culmen, which is the upper ridge of the bill. There is a clear relationship between bill length and depth, but it is masked in the dataset by differences between species (@fig-bill-scatterplot), with Gentoo penguins exhibiting longer but shallower bills, and Adelie penguins shorter and deeper bills.

```{r fig-bill-scatterplot}
#| fig-cap: "Scatterplot of penguin bill length and depth."

# code for plot
```

 

Writing a report – Quarto + {targets}

Two advantages of using a Quarto document alongside {targets}:

  • can read in results from targets pipeline inside the report: no computation done during report generation

  • can add the rendering of the report as a step in the pipeline: ensures that the report is always up-to-date

Writing a report – Quarto + {targets}

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format: docx
---

```{r setup}
#| include: false

library(knitr)

opts_chunk$set(echo = FALSE)
```

This project aims at understanding the differences...

## Distribution of body mass and flipper length

@fig-body-mass shows...

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

# code for plot
```

Two steps to use the {targets} pipeline results in a Quarto document:

Writing a report – Quarto + {targets}

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format: docx
---

```{r setup}
#| include: false

library(knitr)
library(here)

opts_chunk$set(echo = FALSE)
opts_knit$set(root.dir = here())
```

This project aims at understanding the differences...

## Distribution of body mass and flipper length

@fig-body-mass shows...

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

# code for plot
```

Two steps to use the {targets} pipeline results in a Quarto document:

  • Make sure the report ‘sees’ the project root directory

Writing a report – Quarto + {targets}

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format: docx
---

```{r setup}
#| include: false

library(knitr)
library(here)
library(targets)

opts_chunk$set(echo = FALSE)
opts_knit$set(root.dir = here())
```

This project aims at understanding the differences...

## Distribution of body mass and flipper length

@fig-body-mass shows...

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

tar_read(body_mass_plot)
```

Two steps to use the {targets} pipeline results in a Quarto document:

  • Make sure the report ‘sees’ the project root directory

  • Read targets objects with targets::tar_read()

Writing a report – Quarto + {targets}

Adding the Quarto report as a step in the pipeline (need {tarchetypes} and {quarto}):

analysis/_targets.R
library(targets)
library(tarchetypes)
library(here)

source(here("R/helper_functions.R"))

list(
  tar_target(
    penguins_raw_file, 
    here("data/penguins_raw.csv"), 
    format = "file"
  ),
  tar_target(
    penguins_df, 
    read_data(penguins_raw_file)
  ),
  # etc
  tar_quarto(
    report, 
    here("reports/palmerpenguins_report.qmd")
  )
)

     

{targets} – how to reproduce the analysis?

README.md
## How to reproduce the analysis

```{r}
# Install the necessary packages
renv::restore()

# Run the analysis pipeline
targets::tar_make()
```

Conclusion

Creating reproducible R code for data science projects

  • Effort put into making the project tidy and reproducible depends on the size, context, etc.

  • Often simple projects turn into something more! Pays to make things tidy early on

  • Following exactly the guidelines is less important than following a system that is:

    • consistent across projects
    • easily understandable by next users (future you, colleagues, collaborators)
    • safeguards against mistakes (e.g. wrong version of data used for a plot)
    • improves reproducibility

Further reading and inspiration

Thank you for your attention!

olivia.angelin-bonnet@plantandfood.co.nz

Presentation disclaimer

Presentation for
ECSSN and NZSA Joint Webinar, 20 August 2024

Publication data:
Angelin-Bonnet O. September 2024. Setting up a reproducible data analysis project in R featuring GitHub, {renv}, {targets} and more. A Plant & Food Research PowerPoint presentation. SPTS No. 26044.

Presentation prepared by:
Olivia Angelin-Bonnet
Scientist, Data Science
September 2024

Presentation approved by:
Mark Wohlers
Science Group Leader, Data Science
September 2024

For more information contact:
Olivia Angelin-Bonnet
DDI: +64 6 355 6156
Email: Olivia.Angelin-Bonnet@plantandfood.co.nz

 

This report has been prepared by The New Zealand Institute for Plant and Food Research Limited (Plant & Food Research).
Head Office: 120 Mt Albert Road, Sandringham, Auckland 1025, New Zealand, Tel: +64 9 925 7000, Fax: +64 9 925 7001.
www.plantandfood.co.nz

 

DISCLAIMER

The New Zealand Institute for Plant and Food Research Limited does not give any prediction, warranty or assurance in relation to the accuracy of or fitness for any particular use or application of, any information or scientific or other result contained in this presentation. Neither The New Zealand Institute for Plant and Food Research Limited nor any of its employees, students, contractors, subcontractors or agents shall be liable for any cost (including legal costs), claim, liability, loss, damage, injury or the like, which may be suffered or incurred as a direct or indirect result of the reliance by any person on any information contained in this presentation.

© COPYRIGHT (2024) The New Zealand Institute for Plant and Food Research Limited. All Rights Reserved. No part of this report may be reproduced, stored in a retrieval system, transmitted, reported, or copied in any form or by any means electronic, mechanical or otherwise, without the prior written permission of The New Zealand Institute for Plant and Food Research Limited. Information contained in this report is confidential and is not to be disclosed in any form to any party without the prior approval in writing of The New Zealand Institute for Plant and Food Research Limited. To request permission, write to: The Science Publication Office, The New Zealand Institute for Plant and Food Research Limited – Postal Address: Private Bag 92169, Victoria Street West, Auckland 1142, New Zealand; Email: SPO-Team@plantandfood.co.nz.