Chapter 5 Testing with a reproducible environment
We can take testing a step further by setting up an environment that mimics the one we used to train our model and is set up the same way each time. First we need to set up a lockfile for our package that tells R exactly what packages — including the versions — should be used to reproduce this
We’ll be using the renv
package, but a word of warning first: this package is still under active development, so this information may quickly become outdated.
Once you’ve installed renv
, open the R project containing your package and run renv::snapshot()
. There will be a prompt for your consent to alter some files in your project and system. This function will automatically determine the dependencies used in your package, as well as what packages (and versions) are installed on your system, and will record this information in a lock file.
When you open your project in the future, RStudio will automatically load the packages that are recorded in the snapshot, down to the version number. If the specific versions aren’t available, it will download them and install them from source, or using available binaries. Try restarting your R environment to see this happen. We can trigger this behaviour manually with renv::restore()
. To learn more about the renv
package, read the introductory article.
Now we need to configure GitHub actions to use the snapshot we’ve just set up. In the previous action, we used macOS so that we could take advantage of available package binaries for quick installation. I’m going to be using Ubuntu for this to replicate my personal Linux environment, and this means that packages will be installed from source.
Here’s the workflow that I’m using. Read through it and see if you can work out what’s going on, and then we’ll go through the finer details:
on: [push, pull_request]
name: R-CMD-check
jobs:
R-CMD-check:
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout@v1
- uses: r-lib/actions/setup-r@master
with:
r-version: '3.6.1'
- name: Install libcurl
run: sudo apt-get install libcurl4-openssl-dev
- name: Install renv 0.9.2
run: Rscript -e "install.packages('https://cran.r-project.org/src/contrib/renv_0.9.2.tar.gz', repos = NULL, type = 'source')"
- name: Install rcmdcheck 1.3.3
run: Rscript -e "install.packages('https://cran.r-project.org/src/contrib/rcmdcheck_1.3.3.tar.gz', repos = NULL, type = 'source')"
- name: Check
run: Rscript -e "renv::restore();rcmdcheck::rcmdcheck(args = '--no-manual', error_on = 'error')"
There are a few differences with this workflow:
- I’ve specified a version of R to install, for better reproducibility.
- I’ve installing a package for the operating system,
libcurl4-openssl-dev
. Because we’re installing R packages from source, we need to be sure that our operating system has all of the tools it needs to compile the source files. This particular package is required to usecurl
within R; this is pretty important, since if you’re doing anything involving the internet, there’s a good chance you’re usingcurl
somewhere. Your specific package may not need this, but I’m including it here as an example of how to runapt-get
in GitHub Actions. - We’re installing two packages individually here:
renv
andrcmdcheck
. We needrenv
torestore
our lockfile, so we need to specifically install it before we think about package dependencies. Finally,rcmdcheck
is not (usually) a package dependency, and we need it to check our package. Note that we’re manually specifying version numbers here for reproducibility. - Our final step looks a little different. We’re calling
renv::restore()
, which automatically handles all of our package dependencies. Then we runrcmdcheck
as we did before.
On Ubuntu, packages are installed from source. This is much slower; on one of my packages, testing in a reproducible environment on Ubuntu takes 35 minutes, as opposed to 4 minutes. Caching can make life easier, so let’s take a look at that now.
The idea behind caching is to record our installed R packages so that, for future runs, we can have them readily available instead of installing from source. In one of my packages, caching decreased testing time from 35 minutes to 4 minutes. GitHub caches last a week, so if you push commits infrequently you may not realise the benefits of caching.
We implement caching by adding two steps to the above YAML:
- We install and use the
remotes
package to determine the packages that our repository depends upon. - We cache our R libraries folder, given by the
R_LIBS_USER
environment variable, using the hash of the dependencies as the cache key. If the package dependencies change, so too does the hash, and our cache becomes invalid.
There’s a third step that GitHub handles automatically — before the job is complete, the cache is created and our installed packages are stored. On future runs, the cache is unpacked. R will see that the packages are installed and won’t install them again from source.
You can read more about caching in GitHub actions in the official documentation. An example yaml file is given below:
on: [push, pull_request]
name: R-CMD-check
jobs:
R-CMD-check:
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout@v1
- uses: r-lib/actions/setup-r@master
with:
r-version: '3.6.1'
- name: Install libcurl
run: sudo apt-get install libcurl4-openssl-dev
- name: Query dependencies
run: Rscript -e "install.packages('remotes')" -e "saveRDS(remotes::dev_package_deps(dependencies = TRUE), 'depends.Rds')"
- name: Cache R packages
if: runner.os != 'Windows'
uses: actions/cache@v1
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-3.6.1-${{ hashFiles('depends.Rds') }}
restore-keys: ${{ runner.os }}-r-3.6.1-
- name: Install renv 0.9.2
run: Rscript -e "install.packages('https://cran.r-project.org/src/contrib/renv_0.9.2.tar.gz', repos = NULL, type = 'source')"
- name: Install rcmdcheck 1.3.3
run: Rscript -e "install.packages('https://cran.r-project.org/src/contrib/rcmdcheck_1.3.3.tar.gz', repos = NULL, type = 'source')"
- name: Check
run: Rscript -e "renv::restore();rcmdcheck::rcmdcheck(args = '--no-manual', error_on = 'error')"