README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

library(INLAvaan)
library(lavaan)
library(blavaan)
library(tidyverse)
library(semPlot)
library(semptools)
```

## `{INLAvaan}`

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/haziqj/INLAvaan/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/haziqj/INLAvaan/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/haziqj/INLAvaan/branch/main/graph/badge.svg)](https://app.codecov.io/gh/haziqj/INLAvaan?branch=main)
<!-- badges: end -->

> Bayesian structural equation modelling with INLA.

**Soon-ish features**

1. Model fit indices (PPP, xIC, RMSEA, etc.)
2. Prior specification.
3. Fixed values and/or parameter constraints.
4. Specify different families for different observed variable. 
5. Standardised coefficients.

**Long term plan**

1. "Non-iid" models, such as spatio-temporal models.
2. Multilevel-ish kind of models (2-3 levels).
3. Covariates.
4. Multigroup analysis (in principle this is simple, but I have bigger plans for this).
5. Missing data imputation.

## Installation

You need a working installation of [INLA](https://www.r-inla.org).
Following the official instructions given [here](https://www.r-inla.org/download-install), run this command in R:

```r
install.packages(
  "INLA",
  repos = c(getOption("repos"), 
            INLA = "https://inla.r-inla-download.org/R/stable"), 
  dep = TRUE
)
```

Then, you can install the development version of `{INLAvaan}` from GitHub with:

```r
# install.packages("pak")
pak::pak("haziqj/INLAvaan")
```

## First impressions

A simple two-factor SEM with six observed, correlated Gaussian variables.
Let $i=1,\dots,n$ index the subjects.
Conditional on the values of $k$-th latent variable $\eta_{ki}$ for subject $i$, the six measurement model equations are

<br>
<p align="center">
  <img src="man/figures/measeqn.gif" width="40%" style="display: block; margin: auto;" />
</p>
<br>
<!-- $$ -->
<!-- \begin{gathered} -->
<!-- y_{1i} = \lambda_{11} \eta_{1i} \phantom{+ \lambda_{1} \eta_{2i}} + \epsilon_{1i}, \quad \epsilon_{1i} \sim N(0, \theta_{11}) \\ -->
<!-- y_{2i} = \lambda_{21} \eta_{1i} \phantom{+ \lambda_{1} \eta_{2i}} + \epsilon_{2i}, \quad \epsilon_{2i} \sim N(0, \theta_{22}) \\ -->
<!-- y_{3i} = \lambda_{31} \eta_{1i} \phantom{+ \lambda_{1} \eta_{2i}}  + \epsilon_{3i}, \quad \epsilon_{3i} \sim N(0, \theta_{33}) \\ -->
<!-- y_{4i} = \phantom{\lambda_{11} \eta_{2i} +}  \lambda_{42} \eta_{2i} + \epsilon_{4i}, \quad \epsilon_{4i} \sim N(0, \theta_{44}) \\ -->
<!-- y_{5i} = \phantom{\lambda_{11} \eta_{2i} +} \lambda_{52} \eta_{2i} + \epsilon_{5i}, \quad \epsilon_{5i} \sim N(0, \theta_{55}) \\ -->
<!-- y_{6i} = \phantom{\lambda_{11} \eta_{2i} +} \lambda_{62} \eta_{2i} + \epsilon_{6i}, \quad \epsilon_{6i} \sim N(0, \theta_{66}) \\ -->
<!-- \\ -->
<!-- \operatorname{Cov}(\epsilon_{1i},\epsilon_{4i}) = \theta_{14} \\ -->
<!-- \operatorname{Cov}(\epsilon_{2i},\epsilon_{5i}) = \theta_{25} \\ -->
<!-- \operatorname{Cov}(\epsilon_{3i},\epsilon_{6i}) = \theta_{36} \\ -->
<!-- \end{gathered} -->
<!-- $$ -->

For identifiability, we set $\lambda_{11} = \lambda_{42} = 1$.
The structural part of the model are given by these equations:

<br>
<p align="center">
  <img src="man/figures/struceqn.gif" width="30%" style="display: block; margin: auto;" />
</p>
<br>
<!-- $$ -->
<!-- \begin{gathered} -->
<!-- \eta_{1i} = \phantom{b\eta_{1i} +} \zeta_{1i}, \quad \zeta_{1i} \sim N(0, \psi_1) \\ -->
<!-- \eta_{2i} = b\eta_{1i} + \zeta_{2i}, \quad \zeta_{2i} \sim N(0, \psi_2) -->
<!-- \end{gathered} -->
<!-- $$ -->

Graphically, we can plot the following path diagram.

```{r twofacsemsetup}
#| include: false
true_model <- "
  eta1 =~ 1*y1 + 1.2*y2 + 1.5*y3
  eta2 =~ 1*y4 + 1.2*y5 + 1.5*y6
  eta2 ~ 0.3*eta1

  y1 ~~ 0.05*y4
  y2 ~~ 0.05*y5
  y3 ~~ 0.05*y6

  y1 ~~ 0.1*y1
  y2 ~~ 0.1*y2
  y3 ~~ 0.1*y3
  y4 ~~ 0.1*y4
  y5 ~~ 0.1*y5
  y6 ~~ 0.1*y6
"
dat <- lavaan::simulateData(true_model, sample.nobs = 10000)

mod <- "
  eta1 =~ y1 + y2 + y3
  eta2 =~ y4 + y5 + y6
  eta2 ~ eta1
  y1 ~~ y4
  y2 ~~ y5
  y3 ~~ y6
"
fit_lav <- sem(mod, dat)
fit_lav@ParTable$est <- true_vals <-
  c(rep(c(1, 1.2, 1.5), 2), 0.3, rep(0.05, 3), rep(0.1, 6), rep(1, 2))

p <- semPlot::semPaths(
  fit_lav, 
  whatLabels = "est",
  node.width = 1,
  edge.label.cex = 0.75,
  # style = "ram",
  mar = c(-5, -1, 5, -1)
)
```

```{r}
#| label: sempath
#| echo: false
#| message: false

indicator_order <- c(
  "y1", "y2", "y3",
  "y4", "y5", "y6"
)
indicator_factor <- c(
  "eta1", "eta1", "eta1",
  "eta2", "eta2", "eta2"
)
factor_layout <- matrix(c("eta1", "eta2"), byrow = TRUE, nrow = 1)
factor_point_to <- matrix(c("up", "up"), byrow = TRUE, nrow = 1)
p2 <- set_sem_layout(
  p,
  indicator_order = indicator_order,
  indicator_factor = indicator_factor,
  factor_layout = factor_layout,
  factor_point_to = factor_point_to
) |>
  set_curve(c(
    "y1~~y4" = 3,
    "y2~~y5" = 3,
    "y3~~y6" = 3
  )) |>
  change_node_label(list(
    list(node = "et1", to = expression(eta[1])),
    list(node = "et2", to = expression(eta[2]))
  ))
plot(p2)
```

```{r}
# {lavaan} textual model
mod <- "
  # Measurement model
  eta1 =~ y1 + y2 + y3
  eta2 =~ y4 + y5 + y6
  
  # Factor regression
  eta2 ~ eta1
  
  # Covariances
  y1 ~~ y4
  y2 ~~ y5
  y3 ~~ y6
"

# Data set
dplyr::glimpse(dat)
```


To fit this model using `{INLAvaan}`, use the familiar `{lavaan}` syntax. 
The `i` in `isem` stands for `INLA` (following the convention of `bsem` for `{blavaan}`).

```{r inlafit}
#| include: false
#| cache: true
library(INLAvaan)
fit <- isem(model = mod, data = dat)
fit_lav <- sem(mod, dat)
fit_blav <- bsem(mod, dat, sample = 2000, burnin = 1000)
fit_blavvb <- bsem(mod, dat, target = "vb", n.chains = 1, sample = 2000, burnin = 1000)
```

```{r}
#| eval: false
library(INLAvaan)
fit <- isem(model = mod, data = dat)
summary(fit)
```

```{r}
#| echo: false
summary(fit)
```

Compare model fit to `{lavaan}` and `{blavaan}` (MCMC sampling using Stan on a single thread obtaining 1000 burnin and 2000 samples, as well as variational Bayes):

```{r}
#| label: fig-compare
#| echo: false
coef_lav <- lavaan::coef(fit_lav)
coef_inla <- lavaan::coef(fit)

PE_lav <- summary(fit_lav, ci = TRUE)$pe |>
  select(est, ci.lower, ci.upper) |>
  mutate(method = "lavaan") |>
  as_tibble()
garb <- capture.output(tmp <- summary(fit))
PE_inla <- tibble(
  est = as.numeric(tmp[, "Estimate"]),
  ci.lower = as.numeric(tmp[, "pi.lower"]),
  ci.upper = as.numeric(tmp[, "pi.upper"])
) |>
  mutate(method = "INLAvaan")
garb <- capture.output(tmp <- summary(fit_blav))
PE_blav <- tibble(
  est = as.numeric(tmp[, "Estimate"]),
  ci.lower = as.numeric(tmp[, "pi.lower"]),
  ci.upper = as.numeric(tmp[, "pi.upper"])
) |>
  mutate(method = "blavaan")
garb <- capture.output(tmp <- summary(fit_blavvb))
PE_blavvb <- tibble(
  est = as.numeric(tmp[, "Estimate"]),
  ci.lower = as.numeric(tmp[, "pi.lower"]),
  ci.upper = as.numeric(tmp[, "pi.upper"])
) |>
  mutate(method = "blavaan_vb")

bind_rows(
  PE_lav, PE_inla, PE_blav, PE_blavvb
) |>
  mutate(
    truth = rep(true_vals, 4),
    free = rep(partable(fit)$free, 4),
    pxnames = rep(partable(fit)$pxnames, 4),
  ) |>
  mutate(
    type = gsub("\\[[^]]*\\]", "", pxnames),
    across(c(est, ci.lower, ci.upper), \(x) (x - truth) / truth),
  ) |>
  drop_na() |>
  mutate(names = factor(rep(names(coef(fit)), 4), levels = rev(names(coef(fit))))) |>
  ggplot(aes(est, names, color = method)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_point(size = 2, position = position_dodge(width = 0.5)) +
  geom_errorbarh(aes(xmin = ci.lower, xmax = ci.upper), height = 0.2, 
                 position = position_dodge(width = 0.5)) +
  scale_x_continuous(labels = scales::percent) +
  theme_bw() +
  labs(x = "Relative bias", y = NULL)

cli::cli_h2("Compare timing (seconds)")
list(fit, fit_lav, fit_blav, fit_blavvb) |>
  set_names(c("INLAvaan", "lavaan", "blavaan", "blavaan_vb")) |>
  purrr::map_dbl(\(x) x@timing$total) 
```

A little experiment to see how sample size affects run time:

```{r}
#| fig-width: 8
#| fig-height: 5
#| echo: false
#| warning: false
#| message: false
load("inst/timing.RData")
res |>
  unnest(c(res_blav, res_blavvb, res_inla), names_sep = "_") |>
  select(n, ends_with("time")) |>
  rename(
    blavaan = res_blav_time,
    blavaan_vb = res_blavvb_time,
    INLAvaan = res_inla_time
  ) |> 
  pivot_longer(
    cols = -n,
    names_to = "method",
    values_to = "time"
  ) |>
  ggplot(aes(n, time, col = method)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_x_continuous(labels = scales::comma) +
  theme_bw() +
  labs(x = "Sample size", y = "Run time (s)",
       title = "Total run time to fit two factor SEM with varying sample sizes",
       caption = "For MCMC sampling, 1000 burnin and 2000 samples were obtained.\nINLA ran on 6 parallel threads.")
```


## Political democracy data

The quintessential example for SEM is this model from Bollen (1989) to fit a political democracy data set.
Eleven observed variables are hypothesized to arise from three latent variables.
This set includes data from 75 developing countries each assessed on four measures of democracy measured twice (1960 and 1965), and three measures of industrialization measured once (1960). 
The eleven observed variables are:

-   `y1`: Freedom of the press, 1960
-   `y2`: Freedom of political opposition, 1960
-   `y3`: Fairness of elections, 1960
-   `y4`: Effectiveness of elected legislature, 1960
-   `y5`: Freedom of the press, 1965
-   `y6`: Freedom of political opposition, 1965
-   `y7`: Fairness of elections, 1965
-   `y8`: Effectiveness of elected legislature, 1965
-   `y9`: GNP per capita, 1960
-   `y10`: Energy consumption per capita, 1960
-   `y11`: Percentage of labor force in industry, 1960

Variables `y1-y4` and `y5-y8` are typically used as indicators of the latent trait of "political democracy" in 1960 and 1965 respectively, whereas `y9-y11` are used as indicators of industrialization (1960).
It is theorised that industrialisation influences political democracy, and that political democracy in 1960 influences political democracy in 1965.
Since the items measure the same latent trait at two time points, there is an assumption that the residuals of these items will be correlated with each other.
The model is depicted in the figure below.

```{r}
#| echo: false
knitr::include_graphics("https://lavaan.ugent.be/figures/sem.png")
```

The corresponding model in `{lavaan}` syntax is:

```{r}
mod <- "
  # latent variables
  dem60 =~ y1 + y2 + y3 + y4
  dem65 =~ y5 + y6 + y7 + y8
  ind60 =~ x1 + x2 + x3

  # latent regressions
  dem60 ~ ind60
  dem65 ~ ind60 + dem60

  # residual covariances
  y1 ~~ y5
  y2 ~~ y4 + y6
  y3 ~~ y7
  y4 ~~ y8
  y6 ~~ y8
"
```

We will fit this model using `{INLAvaan}` and compare the results with `{blavaan}`.

```{r}
#| label: poldemfit
#| include: false
#| cache: true
data("PoliticalDemocracy", package = "lavaan")
poldemfit <- isem(mod, PoliticalDemocracy, meanstructure = !TRUE, bcontrol = list(num.threads = 6))

library(future)
plan("multisession")
poldemfit_blav <- bsem(
  model = mod, 
  data = PoliticalDemocracy,
  # meanstructure = TRUE,
  n.chains = 3,
  bcontrol = list(cores = 3)
  # burnin = 5000,
  # sample = 10000
)
```

```{r}
#| eval: false
data("PoliticalDemocracy", package = "lavaan")
poldemfit <- isem(model = mod, data = PoliticalDemocracy)
summary(poldemfit)
```

```{r}
#| echo: false
summary(poldemfit) #
```

```{r}
#| label: fig-poldem
#| echo: false
#
garb <- capture.output(tmp <- summary(poldemfit))
PE_inla <- tibble(
  est = as.numeric(tmp[, "Estimate"]),
  ci.lower = as.numeric(tmp[, "pi.lower"]),
  ci.upper = as.numeric(tmp[, "pi.upper"])
) |>
  mutate(method = "INLAvaan")
garb <- capture.output(tmp <- summary(poldemfit_blav))
PE_blav <- tibble(
  est = as.numeric(tmp[, "Estimate"]),
  ci.lower = as.numeric(tmp[, "pi.lower"]),
  ci.upper = as.numeric(tmp[, "pi.upper"])
) |>
  mutate(method = "blavaan")


bind_rows(
  PE_inla, PE_blav
) |>
  mutate(
    free = rep(partable(poldemfit)$free, 2),
    pxnames = rep(partable(poldemfit)$pxnames, 2),
    type = gsub("\\[[^]]*\\]", "", pxnames)
  ) |>
  drop_na() |>
  mutate(names = factor(rep(names(coef(poldemfit)), 2), levels = rev(names(coef(fit))))) |>
  pivot_wider(
    names_from = method,
    values_from = c(est, ci.lower, ci.upper)
  ) |>
  ggplot(aes(est_INLAvaan, est_blavaan, col = type)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  geom_point(size = 3) +
  geom_errorbar(aes(xmin = ci.lower_INLAvaan, xmax = ci.upper_INLAvaan), width = 0.1, alpha = 0.3) +
  geom_errorbar(aes(ymin = ci.lower_blavaan, ymax = ci.upper_blavaan), width = 0.1, alpha = 0.3) +
  theme_bw() +
  labs(
    x = "{INLAvaan} estimates",
    y = "{blavaan} estimates",
    col = "Parameter\ntype",
    title = "Comparison of the estimates for the Political Democracy example",
    caption = "MCMC conducted using Stan (3 parallel chains, 500 burnin, and 1500 samples)."
  )

cli::cli_h2("Compare timing (seconds)")
list(poldemfit, poldemfit_blav) |>
  set_names(c("INLAvaan", "blavaan")) |>
  purrr::map_dbl(\(x) x@timing$total) 
```

## Outro

```{r}
sessioninfo::session_info(info = "all")
```