Part2_SDMs.qmd

# A more realistic example of `SpaDES` {#sec-part2}

Authors: Ceres Barros, Alex M. Chubaty

```{r}
#| echo: false
#| cache: false

source("_common.R")
```

```{r setup}
#| include: false
#| cache: false

projDir <- getwd()   ## save this for later
rmdDir <- "~/SpaDES4Dummies_Part2"   ## same as mainPath below.
dir.create(rmdDir)
knitr::opts_knit$set(
  root.dir = rmdDir
)
options("knitr.graphics.rel_path" = FALSE,
        "spades.useRequire" = FALSE)
```

In [Chapter 1](Part1_DummyModel.qmd) of this guide, we described how to create new `SpaDES` modules, their different components, how to link different modules and how to set up and run a simulation.

Here, we assume that you are familiar with these steps, but go further in showing important `SpaDES` features that facilitate many of the steps common to most ecological modelling exercises.
For the sake of simplicity, we focus our example on projecting a species' distribution as a function of climate covariates.
Yet, the true power of `SpaDES` is more evident when using complex dynamic simulation models parametrised using large datasets and ran across large spatial areas.

This example is broken into four main parts: [Module creation and coding]; [Running the model]; [Caching]; and [Best practices].
By no means does it cover caching or best practices in full, as each of these topics is very extensive, but it highlights some of their essentials in `SpaDES` and from our own experience.\

\
The full R script used to run the example can be found in @sec-scriptPart2 .

## The example: projecting species distribution shifts under climate change

Species distribution models (SDMs) have been widely used in ecology to predict how species presences and absences across a landscape may change under changing environmental conditions.
As a result, there are several R packages that have been built with this in mind [e.g. `dismo` @dismo; `biomod2` @biomod2] and many ecologists do these analyses exclusively in R.

Often, these analyses are run only once for a given set of species, baseline and projected environmental conditions, and researchers will have a few R scripts (or maybe just one longer script) that load the data into R, do any necessary pre-processing steps, fit the models and run species distribution projections.
The usefulness of `SpaDES` comes when we want an automated and standardized workflow that can be easily updated with new data and adapted with new algorithms.
`SpaDES` provides a common standard and a modular approach to modelling that facilitates expanding, debugging and sharing code, but also various tools that bring many well-known best practices from computer- and data-science workflows (including reproducible, modular workflows, and caching), to the realm of ecological modelling, so that they can be used by non-computer-scientists with minimal learning.
In an SDM project this means that updating data and algorithms, and automating iterative forecasting become easier and less prone to errors.
When `SpaDES` modules are open and shared, this also expands a potential pool of users who can themselves help improve the code.

------------------------------------------------------------------------

## Module creation and coding

With the above in mind, in this example we created three modules that source and pre-process data ('data modules') and a module that fits an SDM and iteratively forecasts species distributions (we call it a 'simulation module', although the simulation only involves predicting from a statistical model).
The idea is that we could, for instance, provide different data sources to one of the data modules and only update the parts of the simulation that are affected by this change (i.e. presumably the other data module steps with not be affected).
Or, we could develop a second simulation module using a different SDM approach and swap the two modules to inspect which provides better predictions.

Our data modules are *speciesAbundanceData* and *climateData*.
The simulation module is *projectSpeciesDist*.
We start by creating an `.R` script to set up and control the simulation.
In this example this script is called `Part2_SDMs.R`.

The script begins with a few lines of code that ensure a few packages are installed and loaded (see [Reproducible package installation]).
It then defines the necessary folder directories for the simulation and creates the modules in the `modules/` folder.

```{r globalscriptPart2}
#| eval: false
#| message: false
#| warning: false
#| results: hide
#| class-source: fold-show
#| cache: false

options(repos = c("https://predictiveecology.r-universe.dev/", 
                  CRAN = "https://cloud.r-project.org"))

if (getRversion() < "4.2.1") {
  warning(paste("dismo::maxent may create a fatal error",
                "when using R version < v4.2.1 and from RStudio.\n", 
                "Please upgrade R, or run this script outside of RStudio.\n",
                "See https://github.com/rspatial/dismo/issues/13"))
}

## decide where you're working
mainPath <- file.path("~/SpaDES4Dummies_Part2")
pkgPath <- file.path(mainPath, "packages", version$platform,
                     paste0(version$major, ".", strsplit(version$minor, "[.]")[[1]][1]))
dir.create(pkgPath, recursive = TRUE)
.libPaths(pkgPath, include.site = FALSE) ## install packages in project library (proj-lib)

if (!"remotes" %in% installed.packages(lib.loc = pkgPath))
  install.packages("remotes")

if (!"Require" %in% installed.packages(lib.loc = pkgPath) ||
    packageVersion("Require", lib.loc = pkgPath) < "0.3.1.9015") {
  remotes::install_github("PredictiveEcology/Require@2788b023ad191c29346ef8c64df71b937be307e2",
                          upgrade = FALSE)
}

## Notes: 
## 1) if you are working from RStudio and have an older version of base packages like `Rcpp`, `rlang` 
## (and others) installed, you may  need to run the following lines (and code above) directly from R
## in order to update these base packages
## 2) Please ensure the appropriate Rtools version is installed (see)

## there seems to be a problem with `ragg` and a forced install solves it
if (!"ragg" %in% installed.packages(lib.loc = pkgPath)) {
  install.packages("ragg")
}

Require::Require(c("PredictiveEcology/SpaDES.project@transition (HEAD)", 
                   "PredictiveEcology/SpaDES.core@master (HEAD)",
                   ## these will be needed later on:
                   "ggpubr",
                   "geodata",
                   "SpaDES.tools",
                   "PredictiveEcology/SpaDES.experiment@75d917b70b892802fed0bbdb2a5e9f3c6772f0ba"),
                 require = FALSE,  ## don't load packages yet 
                 upgrade = FALSE, standAlone = TRUE)

Require::Require("SpaDES.core", install = FALSE)  ## load only
setPaths(cachePath = file.path(mainPath, "cache"),
         inputPath = file.path(mainPath, "inputs"),
         modulePath = file.path(mainPath, "modules"),
         outputPath = file.path(mainPath, "outputs"))

simPaths <- getPaths() ## check that this is what you wanted

## Let's create a self-contained module that will simulate the species' abundance for any given period of time and frequency.
if (!dir.exists(file.path(simPaths$modulePath, "speciesAbundanceData"))) {
  newModule(name = "speciesAbundanceData", path = simPaths$modulePath)
}

if (!dir.exists(file.path(simPaths$modulePath, "climateData"))) {
  newModule(name = "climateData", path = simPaths$modulePath)
}

if (!dir.exists(file.path(simPaths$modulePath, "projectSpeciesDist"))) {
  newModule(name = "projectSpeciesDist", path = simPaths$modulePath)
}
```

```{r globalscriptPart2Hidden}
#| include: false
#| cache: false

## for faster rendering use all .libPaths to avoid installing all packages.
options(repos = c("https://predictiveecology.r-universe.dev/", 
                  CRAN = "https://cloud.r-project.org"))

## decide where you're working
mainPath <- file.path("~/SpaDES4Dummies_Part2")
pkgPath <- file.path(mainPath, "packages", version$platform,
                     paste0(version$major, ".", strsplit(version$minor, "[.]")[[1]][1]))
dir.create(pkgPath, recursive = TRUE)
.libPaths(pkgPath) 

if (!"remotes" %in% installed.packages(lib.loc = pkgPath))
  install.packages("remotes")

if (!"Require" %in% installed.packages(lib.loc = pkgPath) ||
    packageVersion("Require", lib.loc = pkgPath) < "0.3.1.9015") {
  remotes::install_github("PredictiveEcology/Require@2788b023ad191c29346ef8c64df71b937be307e2",
                          upgrade = FALSE)
}

## Notes: 
## 1) if you are working from RStudio and have an older version of base packages like `Rcpp`, `rlang` 
## (and others) installed, you may  need to run the following lines (and code above) directly from R
## in order to update these base packages
## 2) Please ensure the appropriate Rtools version is installed (see)

## there seems to be a problem with `ragg` and a forced install solves it
if (!"ragg" %in% installed.packages(lib.loc = pkgPath)) {
  install.packages("ragg")
}

Require::Require(c("PredictiveEcology/SpaDES.project@transition (HEAD)", 
                   "PredictiveEcology/SpaDES.core@master (HEAD)",
                   ## these will be needed later on:
                   "ggpubr",
                   "geodata",
                   "SpaDES.tools",
                   "PredictiveEcology/SpaDES.experiment@75d917b70b892802fed0bbdb2a5e9f3c6772f0ba"),
                 require = FALSE,
                 upgrade = FALSE)


Require::Require("SpaDES.core", install = FALSE)  ## load only
setPaths(cachePath = file.path(mainPath, "cache"),
         inputPath = file.path(mainPath, "inputs"),
         modulePath = file.path(mainPath, "modules"),
         outputPath = file.path(mainPath, "outputs"))

simPaths <- getPaths() ## check that this is what you wanted

mods <- list.files(file.path(projDir, "modules"), full.names = TRUE)
mods <- mods[basename(mods) %in% c("speciesAbundanceData", "climateData", "projectSpeciesDist")]

sapply(mods, function(mod, to) {
  to <- file.path(to, basename(mod))
  dir.create(to, showWarnings = FALSE)
  ff <- list.files(mod, full.names = TRUE) ## don't recurse into dirs to preserve structure
  file.copy(ff, to, recursive = TRUE, overwrite = TRUE)  ## now recurse to copy files in dirs
}, to = simPaths$modulePath)
rm(mods)
```

Notice how we protect the `newModule` call with an `if` statement that first detects whether the module directory exists already.
This is necessary to prevent overwriting existing modules should this script be run a second time in the same location (see [Protect yourself and others from common mistakes/problems]).

`setPaths` will create the project folder directories in case they do not exist (no overwriting occurs in case they do).

Finally, note that we do not load any R packages yet, as we will later use `Require` to make sure all module dependencies are installed before running the simulation (see [Reproducible package installation]).
Because `Require` may attempt to install missing packages and because installing packages should be done in a clean R session, we will only load any packages after all the installation steps are complete.

### Data modules

The next two sections show our two data modules `.R` scripts.
We assume you are already familiar with the different parts of a module `.R` script; if not, see [Chapter 1](Part1_DummyModel.qmd).
We do not discuss the module `.Rmd` files, which should document each module in detail (see \[Module documentation -- module .Rmd\]).

#### *speciesAbundanceData* module:

This module downloads freely available spatial layers of *Picea glauca* percent cover (% cover) across Canada and pre-processes them to match a user-supplied study area raster.
We use the new `terra` package throughout this example, since the `raster` package will soon be discontinued.

The `prepInputs` function downloads the % cover layer from the Canadian National Forest Inventory data server using the URL supplied by `sppAbundURL` and processes it to match the study area raster (`studyAreaRas`) supplied by the user.
The module then outputs *Picea glauca* % cover as a raster (`sppAbundanceRas`) and as a `data.table` (`sppAbundanceDT`).
The `data.table` contains added information about the year of the simulation during which the data should be used (here, only the first year when SDM fitting happens).

We export species % cover in two formats (a raster and a table) for demonstrational purposes, but also because we could envision that this model (i.e. group of modules) could save the species distribution projections for several points in time in a more compact format of a `data.table` -- large raster layers can consume a considerable amount of disk space (see [Coding for the future]).

```{r dataModule1}
#| code: !expr 'readLines(file.path("modules", "speciesAbundanceData", "speciesAbundanceData.R"))'
#| eval: false
```

#### *climateData* module:

This module downloads and processes freely available spatial layers of four bioclimatic variables used to fit the SDM of *Picea glauca* in the study area.

The module uses a different way to download data.
It relies on two input `data.tables` that contain the URLs for each climate covariate, one for baseline conditions, the other for projected climate conditions, both containing information about when each layer should be used during the simulation (the "year"column).

We have only supplied one set of data sources for default baseline climate conditions (`baselineClimateURLs`) and for climate projections (`projClimateURLs`), all of which are downloaded from WorldClim at 2.5 minutes resolution.
The baseline climate data correspond to the 1970-2000 period @FickHijmans2017, which aligns well with the species % cover data year (2001).
The climate projections were obtained for 2021-2040, 2041-2060, 2061-2080 and 2081-2100, from CMIP6 downscaled future projections using the CanESM5 model [@SwartEtAl2019] under the SSP 585 climate scenario.

We encourage providing different (or additional) URLs referring to projections for other climate periods, other climate models and other climate scenarios (see [WorldClim](https://www.worldclim.org/data/cmip6/cmip6climate.html) for a list of climate projections).

If providing other URLs to obtain different climate data, pay special attention to the "year" column of `projClimateURLs` -- the URLs need to correspond to the simulation year during which they will be used (not necessarily the actual climate year, unless the simulation years follow the same numbering).

Like in the `speciesAbundanceData` module, the `prepInputs` function processes the climate layers to match the study area raster (`studyAreaRas`) and compiles all climate data in the `climateDT` object and as raster layer objects (`baselineClimateRas` and `projClimateRas`) -- the module's outputs.

```{r dataModule2}
#| code: !expr 'readLines(file.path("modules", "climateData", "climateData.R"))'
#| eval: false
```

We draw your attention to a few particular aspects of the data modules:

-   How we took care to define the data classes of parameters, expected inputs and module outputs in their respective metadata sections;

-   How we added additional R packages necessary to run the module to the metadata;

-   How we added default values for parameters and inputs explicitly used by the modules (others like `.plotInterval` were left as `NA`).
    The exception was the `studyAreaRas` input object for which we do not provide a default.
    However, we added a code check in `.inputObject` that stops interrupts R if this object is not in `sim` (see [Protect yourself and others from common mistakes/problems])

-   How we use the function `prepInputs` to do most of the heavy-lifting of downloading data and spatial pre-processing.
    This function is able to recognize whether the data has already been downloaded, and can cache all spatial processing tasks (see [Caching]).
    In some cases, we wrapped `prepInputs` in a `Map` call to loop through several URLs and download and pre-process many data layers.
    This `Map` call can also be cached with `Cache`.

-   How we use the function `Plots` to control plotting to the screen device and/or save to image files depending on the `P(sim)$.plots` argument.
    Note that `Plots` works best with functions that output `ggplot` objects, or that are compatible with `quickPlot::Plot`.

-   The fact that neither module depends on the other.
    This is not a required feature of data modules, but just so happens to be the case in this example.
    In fact, in more complex modelling frameworks, like the LandR model [@BarrosEtAlinreview], we often have several data modules that depend on each other (e.g., [LandR *Biomass_speciesData*](https://github.com/PredictiveEcology/Biomass_speciesData) sources and processes tree species percent cover data that is used by [LandR *Biomass_borealDataPrep*](https://github.com/PredictiveEcology/Biomass_borealDataPrep) to estimate several parameters for the forest landscape simulation model [LandR *Biomass_core*](https://github.com/PredictiveEcology/Biomass_core)).

-   How we export objects created within the module functions to `sim`.
    Without doing so, these objects are lost after the function is executed.

### Prediction module

We show below the `.R` script for the *projectSpeciesDist* module.
This module depends entirely on the other two, as we did not provide any default input objects in the `.inputObjects` function.
This is, of course, not good practice, but again we warn the user early on (in the `.inputObjects` function) if the module cannot find the necessary inputs.

This module fits a machine learning SDM using the MaxEnt algorithm implemented in the `dismo` package.
We recommend having a look at [this guide](https://rspatial.org/raster/sdm/index.html) to learn about fitting SDMs with `dismo` and more.
Before fitting the SDM, the module converts any non-binary species data into presences and absences.

The main outputs are species distribution projections in the form of plots and a stacked raster layer (`sppDistProj`) and the fitted SDM object.

```{r simulationModule}
#| code: !expr 'readLines(file.path("modules", "projectSpeciesDist", "projectSpeciesDist.R"))'
#| eval: false
```

We draw your attention to:

-   As we said earlier, we could have added yearly projected values to the `sppAbundanceDT` table.
    In this case we probably would have changed this object's name, since MaxEnt is not modelling species abundance, but probability of occurrence.
    We suggest this as an exercise to do on your own

-   How links with the data modules are established by declaring data modules' output objects as expected inputs for this module.

-   How the `fitSDM` event does not schedule itself (the SDM only needs to be fitted once).

-   How, unlike `fitSDM`, the `projSDM` event schedules itself so that model projections are executed for each year of the simulation, provided that there is corresponding environmental data -- notice how the functions `fitSDMEvent` and `projSDMEvent` both check that there is data for the current year of the simulation (`time(sim)`).

-   How the fitted model object (`sdmOut`) and it's evaluation (`evalOut`) are both module outputs.
    This way these objects can not only be used by other events, but also inspected by the user after the simulation is finished (see [Transparent models]).

You will notice that this module performs model fitting (i.e., calibration), predictions and model validation.
These three components could be broken into three separate modules.
As an exercise, we recommend trying to do so on your own.

### Additional module scripts and functions

All functions needed by a module can be coded within the module script.
Often they will be located after all the event functions and before the `.inputObjects` function; however, their exact location is up to the developer.

For long, complex modules involving many (potentially long) functions, including all functions within a module's script quickly makes the script long and hard to maintain.
Instead, we recommend that additional (i.e., non-event functions) are put into separate scripts and in the module's `R/` folder.
`SpaDES` will source any scripts contained in this older and associate its objects/functions with the respective module.

*Note that, these functions/objects will only be available to the module where the `R/` is located.*

Hence, if one or more functions are meant to be shared across modules they will need to either:

-   be defined in `.GlobalEnv` (the global environment -- see `?.GlobalEnv`) before running `simInit`;

-   be put it in an `R` package that is included in the modules' metadata and installed as other packages -- hence the package is in an online repository like GitHub or CRAN;

-   be put it in a *local* `R` package that is included in the modules' metadata and installed (e.g., `devtools::install_local`) or loaded (e.g. `devtools::load_all("<pathToPackage>")`) "separately" by the user before, as `Require` is not able to install local packages.

In this example, we put the plotting functions called by `Plots` into separate scripts for each module where they are used.
Modules `speciesAbundanceData` and `projectSpeciesDist` both have the function `plotSpatRaster` in a `plotSpatRaster.R` script saved in their respective `R/` folders:

```{r plotSpatRaster}
#| code: !expr 'readLines(file.path("modules", "projectSpeciesDist", "R", "plotSpatRaster.R"))'
#| eval: false
```

The `climateData` module uses the function `plotSparRasterStk`, which is in the `plotSparRasterStk.R` in its `R/` folder:

```{r plotSpatRasterStk}
#| code: !expr 'readLines(file.path("modules", "climateData", "R", "plotSpatRasterStk.R"))'
#| eval: false
```

Go ahead and create these scripts with the functions above and put them in the right `R/` folders.

------------------------------------------------------------------------

## Running the model

### Ensuring all packages are installed

After the modules are created, we go back to the `Part2_SDMs.R` script to set up and run the simulation.
We first ensure that all module dependencies (and their dependencies and so on) are installed in `.libPaths()`.

Only then do we load packages necessary to run the simulation.

Note that the `dismo` package (a dependency of the `projectSpeciesDist` module) needs `rJava` to run `maxent`.
In turn, `rJava` needs a working Java installation (Java can be downloaded [here](See%20https://www.java.com/en/download/manual.jsp)).
Below you will see that we attempt to warn the user about problems loading `rJava`, which are likely related to Java not being found on the system.

```{r simulationPkgs}
#| message: false
#| warning: false
#| results: hide
#| cache: false

outs <- SpaDES.project::packagesInModules(modulePath = simPaths$modulePath)  ## gets list of module dependencies
Require::Require(c(unname(unlist(outs)),
                   "DiagrammeR"), 
                 require = FALSE,   ## don't load packages
                 upgrade = FALSE,   ## don't upgrade dependencies
                 standAlone = TRUE, ## install all dependencies in proj-lib (ignore user/system lib)
                 purge = TRUE) 

## now load packages - SpaDES.core may have been loaded already, which is fine
Require::Require(c("reproducible", "SpaDES.core", "SpaDES.experiment"), 
                 install = FALSE) 

## dismo needs a few tweaks to run MaxEnt
out <- reproducible::preProcess(targetFile = "maxent.jar",
                                url = "https://github.com/mrmaxent/Maxent/blob/master/ArchivedReleases/3.4.4/maxent.jar?raw=true",
                                destinationPath = simPaths$inputPath,
                                fun = NA)
file.copy(out$targetFilePath, file.path(system.file("java", package="dismo"), "maxent.jar"),
          overwrite = TRUE)

out <- require(rJava)
if (!out) {
  stop(paste("Your Java installation may have problems, please check.\n", 
             "See https://www.java.com/en/download/manual.jsp for Java installation"))
}
```

**/!\\ ATTENTION /!\\**

*Windows can present problems when many packages are installed and further package installations fail. If you see errors like this after restarting R:*

```         
Installing: glue Detaching is fraught with many potential problems; you may 
have to restart your session if things aren't working some packages don't seem
to unload their dlls correctly.
These will not be unloaded: ellipsis, vctrs Unloading package bit64 -- 
Installing glue -- (1 of 1. Estimated time left: ...; est. finish: ...calculating) 
Installing package into '\~/R/win-library/4.0' (as 'lib' is unspecified) trying URL
'<https://cran.rstudio.com/bin/windows/contrib/4.0/glue_1.6.2.zip>' Content type
'application/zip' length 171858 bytes (167 KB) downloaded 167 KB

package 'glue' successfully unpacked and MD5 sums checked 
Error in unpackPkgZip(foundpkgs[okp, 2L], foundpkgs[okp, 1L], lib, libs_only, : 
ERROR: failed to lock directory '\~\R\win-library\\4.0' for modifying 
Try removing '\~\R\win-library\\4.0/00LOCK'
```

*If you encounter this error, delete the problematic file/folder and try again.*

*Sometimes `Require` may still unable to automatically install a package and a manual installation from a clean session is the only solution. In the error above, `Require` only detected that `glue` was missing during the `simInit` call, which meant that other packages had been loaded already causing failure when `Require` attempted the installation. This problem persisted even after we avoided loading `terra` before running `simInit` (we make the study area objects prefixing functions with `terra::`), so `glue` had to be manually installed.*

### Simulation set-up

The simulation folder directories were already set up before creating the modules (see above), but it is still necessary to create a few lists that will be passed to the `simInit` function, which initializes the simulation.
These lists define the modules used in the simulation (`simModules`), the start and end of the simulation (`simTimes`), the parameters passed to each module (`simParams`) and external input objects (`simObjects`) like the study area (`studyAreaRas`).

The `studyAreaRas` is created from a random polygon drawn in SW Alberta, Canada, using `SpaDES.tools::randomStudyArea`.
(@fig-studyAreaCanada).

We also define a few useful global options:

-   `reproducible.cachePath` and `reproducible.destinationPath` define the cache directory and the directory where downloaded and processed data will be stored;

-   `reproducible.useCache` and `reproducible.useTerra`, which will activate caching and the use of the `terra` package across all `Cache` and `prepInputs` function calls.

```{r simulationSetup}
#| warning: false
#| message: false

## a few important options:
options(reproducible.useCache = TRUE,
        reproducible.cachePath = simPaths$cachePath,
        reproducible.destinationPath = simPaths$inputPath, ## all downloaded and pre-processed layers go here
        reproducible.useTerra = TRUE, ## we want to use the terra R package
        spades.moduleCodeChecks = FALSE,
        spades.useRequire = FALSE)  

## list the modules to use
simModules <- list("speciesAbundanceData", "climateData", "projectSpeciesDist")

## Set simulation and module parameters
simTimes <- list(start = 1, end = 5, timeunit = "year")

## we create two lists of parameters, one using the default MaxEnt
## the other a GLM
simParamsMaxEnt <- list(
  "speciesAbundanceData" = list(
    ".plots" = c("png"),
    ".useCache" = FALSE
  ),
  "climateData" = list(
    ".plots" = c("png"),
    ".useCache" = FALSE
  ),
  "projectSpeciesDist" = list(
    "statModel" = "MaxEnt",
    ".plots" = c("png"),
    ".useCache" = FALSE
  )
)
simParamsGLM <- simParamsMaxEnt
simParamsGLM$projectSpeciesDist$statModel <- "GLM"

## make a random study area.
##  Here use seed to make sure the same study area is always generated
studyArea <- SpaDES.tools::randomStudyArea(size = 1e10, seed = 123)
studyAreaRas <- terra::rasterize(studyArea, 
                                 terra::rast(extent = terra::ext(studyArea), 
                                             crs = terra::crs(studyArea, proj = TRUE), 
                                             resolution = 1000))
simObjects <- list(
  "studyAreaRas" = studyAreaRas
)

## Simulation setup - create two simulations, one for MaxEnt another for GLM
## SpaDES.experiment::experiment2, will take care of subdirectories to store outputs
mySimMaxEnt <- simInit(times = simTimes, params = simParamsMaxEnt, 
                       modules = simModules, objects = simObjects, 
                       paths = simPaths)
mySimGLM <- simInit(times = simTimes, params = simParamsGLM, 
                    modules = simModules, objects = simObjects, 
                    paths = simPaths)
```

```{r fig-studyAreaCanada}
#| echo: false
#| fig-align: center
#| fig-cap: "Study area within Canada."

## note that we need to make studyArea again because knitr cache and terra obj (which have pointers) do not work well together.
studyArea <- SpaDES.tools::randomStudyArea(size = 1e10, seed = 123)

can1 <- geodata::gadm(country = "CAN", level = 1, path = simPaths$inputPath) 
can1 <- terra::project(can1, terra::crs(studyArea, proj = TRUE))

terra::plot(can1)
terra::plot(studyArea, add = TRUE, col = "darkgreen")
```

Before running the simulation we look at the module linkage diagrams produced by `moduleDiagram` (@fig-moduleDiagram2)) and `objectDiagram` (@fig-objectDiagram2) to assess whether modules are linked as expected.

```{r moduleDiagramsCode}
#| eval: false
moduleDiagram(mySimMaxEnt)
objectDiagram(mySimMaxEnt)
```

```{r fig-moduleDiagram2}
#| echo: false
#| fig-align: center
#| fig-cap: "Module network diagram."
#| fig-height: 6
#| fig-width: 6
#| message: false
#| warning: false

moduleDiagram(mySimMaxEnt)
```

```{r fig-objectDiagram2}
#| echo: false
#| fig-align: center
#| fig-cap: "Module diagram showing module inter-dependencies with object names."

knitr::include_graphics(normalizePath(file.path(projDir, "figures/Part2_objectDiagram.png"), winslash = "/"),
                        dpi = NA)
```

### Simulation runs

To run the simulation, we can call `spades` on the output `simList`s (called `mySimMaxEnt` and `mySimMaxGLM` here) generated by `simInit`, or use `experiment2` from the `SpaDES.experiment` package.
`experiment2` will run as many simulations as `simList`s and organise outputs into sub-folders within the `simPaths$outputs` directory.
It can also repeat simulations (`rep` argument) and parallelise across replicates using `future`.
See `?experiment2` for examples.

We advise using `spades` when running the model for the first time.
Passing the argument `debug = TRUE` will print the progress of the simulation in detail.
This helps diagnosing problems when the simulation fails, but also seeing which events are being executed and when particular cache calls are activated.

```{r runSimulation}
#| eval: false

## run simulation
clearPlot(force = TRUE)   ## this forces wiping the graphics device and opening a new window

## This runs one simulation and stores outputs in the main 'outputs' folder 
## - not what we want, but good for testing
# mySimOut <- spades(mySimMaxEnt, debug = TRUE)  

## Better to use when spades runs error-free on the simLists
myExperiment <- experiment2(MaxEnt = mySimMaxEnt, 
                            GLM = mySimGLM, 
                            debug = TRUE, 
                            replicates = 1,
                            clearSimEnv = FALSE)   ## prevent removing objects from the simLists at the end
## save outputs
qs::qsave(myExperiment, file.path(simPaths$outputPath, paste0("myExperiment", ".qs")))
```

Try to execute the `spades` call twice to see how much faster it runs after many of the operations have been cached (see also [Caching]).

By default the data modules (*speciesAbundanceData* and *climateData*) save figures of the input species and climate layers (@fig-simulationSppAbund and @fig-simulationClimate , respectively).

```{r fig-simulationSppAbund}
#| echo: false
#| warning: false
#| message: false
#| fig-align: center
#| out-width: 50%
#| fig-show: hold
#| fig-cap: "Prediction plots. Input *Picea glauca* percent cover across the landscape. Note that values are converted to presence/absence."

knitr::include_graphics(normalizePath(file.path(projDir, "outputs/GLM_rep1/figures/speciesAbundance.png"), winslash = "/"))
```

```{r fig-simulationClimate}
#| echo: false
#| warning: false
#| message: false
#| fig-align: center
#| out-width: 80%
#| fig-show: hold
#| layout-ncol: 2
#| fig-cap: "Prediction plots. Bioclimatic variables under baseline (year 1) and future conditions."
#| fig-subcap: ""

figs <- list.files(file.path(projDir, "outputs/GLM_rep1/figures/"), pattern = "climateRas", full.names = TRUE)
knitr::include_graphics(normalizePath(figs, winslash = "/"))
```

The prediction module also outputs the projections for each climate period automatically ( @fig-simulationProj and @fig-simulationProj2 ).

```{r fig-simulationProj}
#| echo: false
#| warning: false
#| message: false
#| fig-align: center
#| layout-ncol: 2
#| out-width: 80%
#| fig-show: hold
#| fig-cap: "Prediction plots: Raw predicted values of species probability of occurence under (left to right) baseline climate conditions (first year of simulation), 2021-2040, 2041-2060, 2061-2080 and 2081-2100 climate conditions (second to fifth years of simulation) - using MaxEnt."
#| fig-subcap: ""

figs <- list.files(file.path(projDir, "outputs/MaxEnt_rep1/figures/"), pattern = "projRawVals_MaxEnt", full.names = TRUE)
knitr::include_graphics(normalizePath(figs, winslash = "/"))
```

```{r fig-simulationProj2}
#| echo: false
#| warning: false
#| message: false
#| fig-align: center
#| layout-ncol: 2
#| out-width: 80%
#| fig-show: hold
#| fig-cap: "Prediction plots: Predictions of *Picea glauca* presence/absence under (left to right) baseline climate conditions (first year of simulation), 2021-2040, 2041-2060, 2061-2080 and 2081-2100 climate conditions (second to fifth years of simulation) - using MaxEnt."
#| fig-subcap: ""

figs <- list.files(file.path(projDir, "outputs/MaxEnt_rep1/figures/"), pattern = "projPA_MaxEnt", full.names = TRUE)
knitr::include_graphics(normalizePath(figs, winslash = "/"))
```

The projected layers can also be accessed and plotted via the `simList` object, as can the model validation results.

From the results we can see that the MaxEnt and GLM predictions do not seem to agree, indicating a potential problem.
We may be missing important covariates, interactions, or simply more appropriate algorithms.

Peruse each model's estimated coefficients and residuals, and validation results will be a good first step to diagnosing the problem.

```{r evalOutSim}
#| eval: false

myExperiment$MaxEnt_rep1$sdmOut   ## this links to an html page

sets <- par(mfrow = c(2,2))
plot(myExperiment$GLM_rep1$sdmOut)
par(sets)

## check validation results for the two models
myExperiment$MaxEnt_rep1$evalOut
myExperiment$GLM_rep1$evalOut
```

#### Adding a new climate scenario

Because data were linked to the modules (and the forecasting) via the modules' metadata and inputs, adding a new climate scenario and re-running forecasts is easy.

To do so, we need only to change the URLs for the climate layers, by passing a custom `projClimateURLs` `data.table` to the *climateData* module.
`SpaDES` will take care of downloading and processing the new layers, as well as forecasting.
Model fitting will also be repeated, even if the baseline data did not change, because the `kfold` function we use to partition the data into the training and testing subsets randomly assigns cases to each group.
If this was not desired, we could set a random seed before running the fitting event (`fitSDM`) by passing the `.seed` parameter to the *projectSpeciesDist* module (e.g., `.seed = list("fitSDM" = 123)`).

```{r newScenarios}
#| eval: false

## Run with another climate scenario - the most contrasting scenario to SSP 585
## get the original table from one of the simulations and replace the climate scenario
projClimateURLs <- copy(mySimMaxEnt$projClimateURLs)
projClimateURLs[, `:=`(URL = gsub("ssp585", "ssp126", URL),
                       targetFile = gsub("ssp585", "ssp126", targetFile))]

## this time we pass the new table or URLs to the modules, so that climate layers are changed
simObjects2 <- list(
  "studyAreaRas" = studyAreaRas,
  "projClimateURLs" = projClimateURLs
)

mySimMaxEnt2 <- simInit(times = simTimes, params = simParamsMaxEnt, 
                        modules = simModules, objects = simObjects2, 
                        paths = simPaths)
mySimGLM2 <- simInit(times = simTimes, params = simParamsGLM, 
                     modules = simModules, objects = simObjects2, 
                     paths = simPaths)

myExperiment2 <- experiment2(MaxEnt = mySimMaxEnt2, 
                             GLM = mySimGLM2, 
                             debug = TRUE, 
                             replicates = 1,
                             clearSimEnv = FALSE)
## save outputs
qs::qsave(myExperiment2, file.path(simPaths$outputPath, paste0("myExperiment2", ".qs")))
```

#### Proposed exercises

1.  try changing the climate layers (e.g., use different climate scenarios or General Circulation models) and rerunning predictions;

2.  try adding other statistical algorithms;

3.  try breaking up the prediction module into three modules: a calibration module, a prediction module and a validation module.

Have fun!

#### Making use of `simList` for reporting

Another advantage of having all simulation parameters, inputs and outputs centralised in one object, is that we can easily inspect and manipulated them afterwards, without the need to load separate objects back into `R`.

Here we show how we capitalize on this `SpaDES` feature to create figures of the outputs (@fig-newScenarios).

```{r newScenariosFigCode}
#| eval: false

## MaxEnt predictions across time and for each climate scenario --------------
## combine plots from two distinct simulations in a single figure
## (the same can be done to compare MaxEnt and GLM, or plot all projections)

## fetch the internal plotting function instead of repeating code here
plotFun <- myExperiment$GLM_rep1@.envir$.mods$climateData$plotSpatRasterStk

## raw predictions exported by the module
sppDistProjMaxEnt <- myExperiment$MaxEnt_rep1$sppDistProj
sppDistProjMaxEnt2 <- myExperiment2$MaxEnt_rep1$sppDistProj

## we convert the raw predictions into presence absence
## using exported threshold
sppDistProjMaxEnt_PA <- as.int(myExperiment$MaxEnt_rep1$sppDistProj > myExperiment$MaxEnt_rep1$thresh)
sppDistProjMaxEnt2_PA <- as.int(myExperiment2$MaxEnt_rep1$sppDistProj > myExperiment2$MaxEnt_rep1$thresh)

## rename layers from plotting
names(sppDistProjMaxEnt) <- names(sppDistProjMaxEnt2) <- c("2001", "2021-2040", "2041-2060", "2061-2080", "2081-2100")
names(sppDistProjMaxEnt_PA) <- names(sppDistProjMaxEnt2_PA) <- c("2001", "2021-2040", "2041-2060", "2061-2080", "2081-2100")

## for a simpler plot choose only years 2001, 2041-2060 and 2081-2100
yrs <- c("2001", "2041-2060", "2081-2100")
plotMaxEnt <- plotFun(sppDistProjMaxEnt[[yrs]], 
                      xlab = "Longitude", y = "Latitude",
                      plotTitle = "MaxEnt raw predictions - SSP 585") +
  scale_fill_viridis_c(na.value = "grey90", limits = c(0,1), begin = 0.25)
plotMaxEnt2 <- plotFun(sppDistProjMaxEnt2[[yrs]], 
                       xlab = "Longitude", y = "Latitude",
                       plotTitle = "MaxEnt raw predictions - SSP 126") +
  scale_fill_viridis_c(na.value = "grey90", limits = c(0,1), begin = 0.25)
plotMaxEnt_PA <- plotFun(sppDistProjMaxEnt_PA[[yrs]], 
                         xlab = "Longitude", y = "Latitude",
                         plotTitle = "MaxEnt presence/absence - SSP 585") +
  scale_fill_viridis_c(na.value = "grey90", limits = c(0,1), begin = 0.25)
plotMaxEnt2_PA <- plotFun(sppDistProjMaxEnt2_PA[[yrs]], 
                          xlab = "Longitude", y = "Latitude",
                          plotTitle = "MaxEnt presence/absence - SSP 126") +
  scale_fill_viridis_c(na.value = "grey90", limits = c(0,1), begin = 0.25)

## organise the plots with mildest scenario first
## It is clear that MaxEnt and GLM do not agree in their prediction
plotAll <- ggpubr::ggarrange(
  plotMaxEnt2 + 
    labs(title = expression(bold("Scenario - SSP 126")),
         y = expression(atop(bold("Raw predictions"), "Latitude"))) +
    theme(legend.title = element_blank(), legend.key.height = unit(3, "lines"),
          plot.title = element_text(hjust = 0.5), plot.margin = margin(0,0,0,0)), 
  plotMaxEnt + 
    labs(title = expression(bold("Scenario - SSP 585")),
         y = expression(atop(bold(""), ""))) +
    theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(0,0,0,0)),
  plotMaxEnt2_PA + 
    labs(title = expression(bold("")),
         y = expression(atop(bold("Presence/absence"), "Latitude"))) +
    theme(plot.margin = margin(0,0,0,0)), 
  plotMaxEnt_PA + 
    labs(title = expression(bold("")),
         y = expression(atop(bold(""), ""))) +
    theme(plot.margin = margin(0,0,0,0)), 
  legend = "right", common.legend = TRUE, labels = c("a)", "b)", "c)", "d)"))

## save figure:
figDir <- checkPath(file.path(simPaths$outputPath, "generalFigures"), create = TRUE)
ggsave(file.path(figDir, "MaxEntPredictions.png"), width = 13.5, height = 5.5, units = "in", dpi = 300)
```

```{r fig-newScenarios}
#| echo: false
#| warning: false
#| message: false
#| fig-align: center
#| out-width: 80%
#| fig-show: hold
#| fig-cap: "Adding a new scenario: Predictions of *Picea glauca* probabilities of presences and presence/absence under (left to right) baseline climate conditions, 2041-2060, and 2081-2100 climate projections under two emission scenarios (SSP 136 and SSP 585, the default) -- showing MaxEnt forecasts only."

knitr::include_graphics(normalizePath(file.path(projDir, "outputs/generalFigures/MaxEntPredictions.png"), winslash = "/"))
```

------------------------------------------------------------------------

## Caching

In this example, we relied on caching to avoid having to repeat computationally intensive operations.
Running the `simInit` and `spades` calls a second time (even after restarting R session) was faster and `SpaDES` informed us of instances where cached objects were being retrieved:

```         
(...) 
Mar05 19:56:53 clmtDt 1 climateData init 1\
Mar05 19:56:53 clmtDt ...(Object to retrieve (a7816e2d0deb3b29.rds)) Mar05 19:56:53 clmtDt loaded cached result from previous Map call
(...)
```

Caching in `SpaDES` is managed by the `reproducible` package, and can be generally broken down into two types: explicitly coded by the module developer, or internal to `SpaDES` functions.

### Explicitly caching operations

Throughout the data modules we explicitly cached several data preparation operations using the functions `Cache` and `prepInputs` from the `reproducible` package.

In brief, `Cache` searches for a stored (i.e. cached) output of a given function call; if it does not find it, `Cache` executes the function call, saves its output and saves information about the function inputs and the function's code.
If it does find it, `Cache` compares the present inputs and function code against their cached counterparts.
In case of a mismatch, the function call is executed again and re-cached.

`prepInputs` calls `Cache` internally at several points, notably to cache spatial processing tasks (e.g. projecting and cropping spatial layers to a study area raster).
Another great feature of `prepInputs` is that when it has a source URL for the target file (as when we used `prepInputs` to download species % cover and climate layers), it first checks whether the data have already been downloaded (and potentially extracted from an archive folder -- `.zip` file).
This is not the same thing as caching, but also avoids unnecessary downloads that can be time consuming.

Note that caching operations involving stochasticity should be avoided, as it will prevent new random outputs from being generated.

We recommend exploring the examples available in the `Cache` and `prepInputs` R documentation to learn more about their capabilities.
In particular, read about `showCache`, `clearCache` and the argument `userTags`, which allow consulting and deleting cached files.

**/!\\ ATTENTION /!\\**

*`Cache` does not deal well with the `apply` family of functions, which is why we used `Map` (instead of `mapply`) to iteratively apply `prepInputs` to several climate layer URLs.*

### Implicit caching of events

`SpaDES` offers implicit caching of events via the global parameter `.useCache`, which comes in the template modules generated by `newModule`.
We call this "implicit" caching, because the developer does not need to add any caching mechanisms to the module code.
`SpaDES` automatically reads the value of the `.useCache` parameter and activates caching in the module accordingly.

This parameter can be used to cache (or not) all or some module events (in their entirety).
In our example, we cached have not cached events but passing `.useCache = ".inputObjects"` or `.useCache = "init"` would cache these events.

Notice how loading cached events (last line below) produces a slightly different message from loading other cached operations (second line below):

```         
Mar05 19:58:34 spcsbn 1 speciesAbundanceData init 1
Mar05 19:58:34 spcsbn ...(Object to retrieve (bffbc48cc055c846.rds)) 
Mar05 19:58:35 spcsbn loaded cached copy of init event in speciesAbundanceData module.
```

We have noted that `terra` objects have issues with caching due to the fact that they rely on pointers.
This is a behaviour that we started observing in September 2022 and may be resolved by `terra` developers in the future.

### Controlling caching without changing module code

In addition to the , which controls caching at the module level.

The user can turn caching on/off without caching module code via three different mechanisms:

-   via the `.useCache` parameter -- as explained above ([Implicit caching of events]), setting this parameter controls event caching inside a module;

-   via `options("reproducible.useCache")` -- setting this option to `TRUE` or `FALSE` in the global environment (`.GlobalEnv`) will affect *all* caching (inside and outside `SpaDES` modules and the simulation);

-   via the argument `spades(.useCache = ...)` -- this argument behaves in the same way as the `.useCache` module parameter, but supersedes it across *all* modules (i.e. if `spades(..., .useCache = FALSE)`, caching will be turned off even if a module's `.useCache` is `TRUE`).

------------------------------------------------------------------------

## Best practices

### Reproducible package installation

When sharing code, it is good practice to provide other users with a list of necessary packages (e.g. by listing the sequence of `library` calls at the start of a script).
We go a step further and advise users to provide code that automatically installs all necessary packages at the start of their controller script.
In addition all modules should contain a full list of packages that they depend on, and any particular versions necessary.
If `options("spades.useRequire")` is set to `TRUE` (the default), `SpaDES` will automatically attempt to install any packages listed across all modules if they are not installed in `.libPaths()`, or if the installed version (or branch if installing from GitHub) does not correspond to what is listed in the module `.R` script.
Users can also use `Require::pkgSnapshot()` to save a list of installed packages that can be used later by `Require` to install all necessary packages in another machine (see example below).

Please beware that package installation should be done as much as possible from a clean R session especially in the context of a `SpaDES`-based project, where each module can potentially have many different dependencies, which have dependencies of their own (see, for instance, how we delayed package loading until after all modules were in place and had their dependencies checked in `Part2_SDMs.R`)

```{r pkgSnapshot}
#| eval: false

Require::pkgSnapshot("pkgsnapshot.txt", libPaths = .libPaths()[1])  ## you should make sure the right .libPath is chosen

## on another machine:
Require::Require(packageVersionFile = "pkgsnapshot.txt")

## See ?Require::pkgSnapshot() for more examples.
```

### Protect yourself and others from common mistakes/problems

A developer should put in place code checks, warnings and messages that protect and warn the user against common mistakes or issues.
Some of these fall in the category of *code assertions* -- small tests that verify a snippet of code.
More complex tests that assess whether the module (or a group of modules) is producing expected results for, e.g., an ecological point of view fall in the category of *integration tests*.
Here, we only talk about code assertions.

A common assertion is to verify that input format and class conform to what the function expects.
If this is not the case, the developer may add a mechanism to correct the faulty inputs (potentially with a warning or message telling the user it did so) or simply stop the computations with a meaningful error.
We provide two examples in the `climateData` module, where the `climateInit` function checks whether the bioclimatic variable names are consistent between the baseline and projected climate data, and whether their raster layers match.

Other assertions can prevent undesirable function behaviours, such as the `if` statement protecting the `newModule` call in `Part2_SDMs.R`, or warn the user that something is missing early on, such as the check for `studyAreaRas` existence in the `.inputObjects` of the data modules).

Bear in mind that these are just examples assertions and integration tests are as diverse as the code they test.

### Readable code

There are several guides on how to write reader-friendly code.
Even if the developer is forever the sole reader of their own code, there are benefits to writing readable code.
First, working on it is less tiresome.
Second, we quickly forget why we wrote code in a certain away.
Code that is well documented and readable is easier to "come back to" and adapt.

We follow many of the [recommendations by Hadley Wickham](https://adv-r.had.co.nz/Style.html), and highlight below those that we find particularly important:

-   spacing around operators;

-   spacing before left parenthesis, except in a function call;

-   adding curly braces after `if`, `else`, `for` and `function`, unless they are very short statements;

-   thoroughly commenting the code;

-   naming functions meaningfully and avoiding to re-use function names (e.g. avoid `c <- function (...) {}`, as `c` is already a `base` function).

You can automatically cleanup and format your code using the `styler` package.
This package provides an Rstudio addin to easily style a block of selected code, or an entire file.

### Module documentation -- module `.Rmd`

When modules are created using `newModule`, this function provides a template module `.Rmd` file that is meant to document the module.
The template suggests a few key sections that should be part of any module's documentation.
Notably, an overview of the module and of its inputs, parameters, outputs and general event flow, together with more in-depth descriptions of each of these sections.

The documentation may also contain reproducible examples of how a module can be used, although this is not always relevant.
For instance, data modules are often meaningless without downstream modules that use their outputs.

We invite the reader to see the manual of our forest landscape simulation model [LandR *Biomass_core*](https://htmlpreview.github.io/?https://github.com/PredictiveEcology/Biomass_core/blob/development/Biomass_core.html), as an example of how we document some of our `SpaDES` modules.

### Coding for the future

We often make coding decisions that we regret a few months down the line.
This is why as module developers, it is a good idea to think about other possible applications of a module or potential expansion avenues.
For instance, trying to imagine if the module can be scaled up or transferred to different study areas, may influence the format of expected inputs and of outputs.
In our example, we exported the same type of information (species % cover and climate data) as raster layers and as tables, because we could foresee that the tables could be used to store several projections in a more compact format.

### Transparent models

Model transparency is not only about using open source code and making it available.
Providing easy access to model data, parameters and outputs is also important.
For instance, in our example we deliberately exported the fitted statistical model `sdmOut`, data (`sdmData`) and evaluation statistics (`evalOut`) so that they can be more easily inspected by the user, without needing to "dive in" the code.

`SpaDES` also offers the ability to save any objects that are exported to the `simList` object *without having to change module code*.
To do so, the user passes a `data.frame` of object names and (potentially) the simulation times when they should be saved to the `simInit(outputs = ...)` argument.
Because objects are saved as `.rds` files by default, any object class can be saved to disk (see `?outputs` for more information).

------------------------------------------------------------------------

## Additional notes

`SpaDES` is an extremely powerful family of R packages, whose potential goes well beyond what has been discussed here.
We recommend going to the [`SpaDES` webpage](https://predictiveecology.org/) to find out more about the `SpaDES` R modelling platform, upcoming workshops and publications.
See also the [Predictive Ecology Github repository](https://https://github.com/PredictiveEcology/) for a list of all available `SpaDES` modules and `SpaDES`-related packages that we maintain.

We wish to acknowledge the World Climate Research Programme, which coordinated and promoted CMIP6, and thank the climate modelling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and WorldClim for downscaling and sharing climate projections and preparing bioclimatic variables.

------------------------------------------------------------------------

<center>**Happy SpaDESing!**</center>