chapter_5_lesson_1.qmd

---
title: "Linear Models, GLS, and Seasonal Indicator Variables"
subtitle: "Chapter 5: Lesson 1"
format: html
editor: source
sidebar: false
---

```{r}
#| include: false
source("common_functions.R")
```

```{=html}
<script type="text/javascript">
 function showhide(id) {
    var e = document.getElementById(id);
    e.style.display = (e.style.display == 'block') ? 'none' : 'block';
 }
 
 function openTab(evt, tabName) {
    var i, tabcontent, tablinks;
    tabcontent = document.getElementsByClassName("tabcontent");
    for (i = 0; i < tabcontent.length; i++) {
        tabcontent[i].style.display = "none";
    }
    tablinks = document.getElementsByClassName("tablinks");
    for (i = 0; i < tablinks.length; i++) {
        tablinks[i].className = tablinks[i].className.replace(" active", "");
    }
    document.getElementById(tabName).style.display = "block";
    evt.currentTarget.className += " active";
 }    
</script>
```


## Learning Outcomes

{{< include outcomes/_chapter_5_lesson_1_outcomes.qmd >}}


## Preparation

-   Read Sections 5.1-5.5


## Learning Journal Exchange (10 min)

-   Review another student's journal

-   What would you add to your learning journal after reading another student's?

-   What would you recommend the other student add to their learning journal?

-   Sign the Learning Journal review sheet for your peer


## Small Group Activity: Linear Models (5 min)

### Definition of Linear Models

::: {.callout-note icon=false title="Linear Model for a Time Series"}
A model for a time series $\{x_t : t = 1 \ldots n \}$ is **linear** if it can be written in the form 
$$
x_t = \alpha_0 + \alpha_1 u_{1,t} + \alpha_2 u_{2,t} + \alpha_3 u_{3,t} + \ldots + \alpha_m u_{m,t} + z_t
$$

where $u_{i,t}$ represents the $i^{th}$ predictor variable observed at time $t$, $z_t$ is the value of the error time series at time $t$, the values $\{\alpha_0, \alpha_1, \alpha_2, \ldots, \alpha_m\}$ are model parameters, and $i = 1, 2, \ldots, n$ and $t = 1, 2, \ldots, m$.

The error terms $z_t$ have mean 0, but they do not need to follow any particular distribution or be independent.
:::

With a partner, practice determining which models are linear and which are non-linear.

<!-- Check your Understanding -->

::: {.callout-tip icon=false title="Check Your Understanding"}

-   Which of the following models are linear?

    -   Model 1:
$$
  x_t = \alpha_0 + \alpha_1 \sin(t^2-1) + \alpha_2 e^{t} + z_t
$$

    -   Model 2:
$$
  x_t = \sqrt{ \alpha_0 + \alpha_1 t + z_t }
$$

    -   Model 3:
$$
  x_t = \alpha_0 + \alpha_1 t^2 + \alpha_2 ( t \cdot \sqrt{t-3} ) + z_t
$$

    -   Model 4:
$$
  x_t = \frac{ \alpha_0 + \alpha_1 t }{ 1 + \alpha_2 \sqrt{t+1} } + z_t
$$

    -   Model 5:
$$
  x_t = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + \alpha_3 t^3 + z_t
$$

    -   Model 6:
$$
  x_t = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + \alpha_1 \alpha_2 t^3 + z_t
$$


-   Is there a way to transform the following non-linear models into a linear model? If so, apply the transformation.

    -   Model 7:
$$
  x_t = \sqrt{ \alpha_0 + \alpha_1 t + z_t }
$$

    -   Model 8:
$$
  x_t = \sin( \alpha_1 + \alpha_2 t + z_t )
$$

    -   Model 9:
$$
  x_t = \alpha_0 \sin( \alpha_0 + \alpha_1 t + z_t )
$$

    -   Model 10:
$$
  x_t = \alpha_0 ~ e^{ \alpha_1 t + \alpha_2 t^2 + z_t }
$$

:::


### Stationarity of Linear Models

<!-- Check your Understanding -->

::: {.callout-tip icon=false title="Check Your Understanding"}

Complete the following quotes from page 93 of the textbook:

-   "Linear models for time series are non-stationary when they include functions of ______."
-   ______________________ can often transform a non-stationary series with a deterministic trend to a stationary series.

:::

The book presents a time series that is modeled as a linear function of time plus white noise:
$$
  x_t = \alpha_0 + \alpha_1 t + z_t
$$
If we difference these values, we get:
\begin{align*}
  \nabla x_t 
    &= x_t - x_{t-1} \\
    &= \left[ \alpha_0 + \alpha_1 t + z_t \right] - \left[ \alpha_0 + \alpha_1 (t-1) + z_t \right] \\
    &= z_t - z_{t-1} + \alpha_1
\end{align*}

Given that the white noise process $\{ z_t \}$ is stationary, then $\{ \nabla x_t \}$ is stationary. Differencing is a powerful tool for making a time series stationary.


In <a href="https://byuistats.github.io/timeseries/chapter_4_lesson_2.html#DifferenceOperator">Chapter 4 Lesson 2</a>, we established that taking the difference of a non-stationary time series with a stochastic trend can convert it to a stationary time series. Later in the <a href="https://byuistats.github.io/timeseries/chapter_4_lesson_2.html#RepeatedDifferences">same lesson</a>, we learn that a linear deterministic trend can be removed by taking the first difference. A quadratic deterministic trend can be removed by taking the second differences, and so on. 

::: {.callout-note}
In general, if there is an $m^{th}$ degree polynomial trend in a time series, we can remove it by taking $m$ differences.
:::


## Class Activity: Fitting Models (15 min)

### Simulation

In Section 5.2.3, the textbook illustrates the time series 
$$
  x_t = 50 + 3t + z_t
$$
where $\{ z_t \}$ is the $AR(1)$ process $z_t = 0.8 z_{t-1} + w_t$ and $\{ w_t \}$ is a Gaussian white noise process with $\sigma = 20$. The code below simulates this time series and creates the resulting time plot. 

```{r}
#| code-fold: true
#| code-summary: "Show the code"
#| label: fig-simulatedTimePlot
#| fig-cap: "Time plot of a simulated time series with a linear trend."

set.seed(1)

dat <- tibble(w = rnorm(100, sd = 20)) |>
    mutate(
        Time = 1:n(),
         z = purrr::accumulate2(
            lag(w), w, 
            \(acc, nxt, w) 0.8 * acc + w,
            .init = 0)[-1],
          x = 50 + 3 * Time + z
            ) |>
    tsibble::as_tsibble(index = Time)
dat |> autoplot(.var = x) +
  theme_minimal()
```

The advantage of simulation is that when we fit a model to the data, we can asses the fit...we know exactly what was simulated.

### Model Fitted to the Simulated Data

When applying ordinary least-squares multiple regression, it is common to fit the model by minimizing the sum of squared errors:
$$
  \sum r_i^2 = \sum \left( x_t - \left[ \alpha_0 + \alpha_1 u_{1,t} + \alpha_2 u_{2,t} + \ldots + \alpha_m u_{m,t} \right] \right)^2
$$
In R, we accomplish this using the `lm` function.

We assume that the simulated time series above follows the model
$$
  x_t = \alpha_0 + \alpha_1 t + z_t
$$


```{r}
#| code-fold: true
#| code-summary: "Show the code"

dat_lm <- dat |>
  model(lm = TSLM(x ~ Time))
params <- tidy(dat_lm) |> pull(estimate)
stderr <- tidy(dat_lm) |> pull(std.error)
```

This gives the fitted equation
$$
  \hat x_t = `r params[1] |> round(3)` + `r params[2] |> round(3)` t  
$$

```{r}
#| label: sets the number of values in the simulation
#| echo: false

n <- 1000
```

The standard errors of these estimated parameters tend to be underestimated by the ordinary least squares method. To illustrate this, a simulation of `n = `r n`` realizations of the time series above was conducted.

The parameter estimates and their standard errors are summarized in the table below. The "Computed SE" is the standard error reported by the `lm` function in R. The "Simulated SE" is the standard deviation of the parameter estimated obtained in the `r n` simulated realizations of this time series.

```{r}
#| label: Simulation...
#| code-fold: true
#| code-summary: "Show the code"

# This code simulates the data for this example
#
# parameter_est <- data.frame(alpha0 = numeric(), alpha1 = numeric())
# for(count in 1:n) {
#   dat <- tibble(w = rnorm(100, sd = 20)) |>
#       mutate(
#           Time = 1:n(),
#            z = purrr::accumulate2(
#               lag(w), w, 
#               \(acc, nxt, w) 0.8 * acc + w,
#               .init = 0)[-1],
#             x = 50 + 3 * Time + z
#               ) |>
#       tsibble::as_tsibble(index = Time)
#   
#   dat_lm <- dat |>
#   model(lm = TSLM(x ~ Time))
# 
#   parameters <- tidy(dat_lm) |> pull(estimate)
#   parameter_est <- parameter_est |>
#     bind_rows(data.frame(alpha0 = parameters[1], alpha1 = parameters[2])) 
# }
# 
# parameter_est |> 
#   rio::export("data/chapter_5_lesson_1_simulation.parquet")

parameter_est <- rio::import("https://byuistats.github.io/timeseries/data/chapter_5_lesson_1_simulation.parquet")


standard_errors <- parameter_est |>
  summarize(
    alpha0 = sd(alpha0), 
    alpha1 = sd(alpha1)
  )
```

| Parameter | Estimate | Computed SE | Simulated SE |
|:---------:|:--------:|:---------:|:---------:|
| $\alpha_0$ | `r params[1] |> round(3)` | `r stderr[1] |> round(3)` | `r standard_errors$alpha0 |> round(3)` |
| $\alpha_1$ | `r params[2] |> round(3)` | `r stderr[2] |> round(3)` | `r standard_errors$alpha1 |> round(3)` |

: Autocorrelation function and partial autorcorrelation function of the residuals from the fitted linear model for the simulated data. Several realizations of the time series were simulated, and the standard deviation of the estimated parameters from the simulation is compared against the standard errors obtained by ordinary least squares regression. {#tbl-simParameters}


::: {.callout-warning}
Note that the simulated standard errors are much larger than those obtained by the `lm` function in R. The standard errors reported by R will lead to unduly small confidence intervals. *This illustrates the problem of applying the standard least-squares estimates to time series data.*
:::

After fitting a regression model, it is appropriate to review the relevant diagnostic plots. Here are the correlogram and partial correlogram of the residuals.

```{r}
#| code-fold: true
#| code-summary: "Show the code"
#| label: fig-simulatedACFpacf
#| fig-cap: "Autocorrelation function and partial autorcorrelation function of the residuals from the fitted linear model for the simulated data."
#| warning: false
#| fig-asp: 0.5

acf_plot <- residuals(dat_lm) |> feasts::ACF() |> autoplot()

pacf_plot <- residuals(dat_lm) |> feasts::PACF() |> autoplot()

acf_plot | pacf_plot
```

<a id="FittingRegModelWithPACF">Recall</a>
that in our simulation, the residuals were modeled by an $AR(1)$ process. So, it is not surprising that the residuals are correlated and that the partial autocorrelation function is only significant for the first lag. 


### Fitting a Regression Model to the Global Temperature Time Series


```{r}
#| echo: false

temps_ts <- rio::import("https://byuistats.github.io/timeseries/data/global_temparature.csv") |>
  as_tsibble(index = year)
```

In <a href="https://byuistats.github.io/timeseries/chapter_4_lesson_2.html#GlobalWarming">Chapter 4 Lesson 2</a>, we fit AR models to data representing the change in the Earth's mean annual temperature from `r min(temps_ts$year)` to `r max(temps_ts$year)`. These values represent the deviation of the mean global surface temperature from the long-term average from 1951 to 1980. (Source: NASA/GISS.) We will consider the portion of the time series beginning in 1970.

```{r}
#| code-fold: true
#| code-summary: "Show the code"
#| label: fig-globalTempTimePlot
#| fig-cap: "Time plot of the mean annual temperature globally for select years."

temps_ts <- rio::import("https://byuistats.github.io/timeseries/data/global_temparature.csv") |>
  as_tsibble(index = year) |>
  filter(year >= 1970)

temps_ts |> autoplot(.vars = change) +
    labs(
      x = "Year",
      y = "Temperature Change (Celsius)",
      title = paste0("Change in Mean Annual Global Temperature (", min(temps_ts$year), "-", max(temps_ts$year), ")")
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5)
    ) + 
  geom_smooth(method='lm', formula= y~x)
```

Visually, it is easy to spot a positive trend in these data. Nevertheless, the ordinary least squares technique underestimates the standard error of the constant and slope term. This can lead to errant conclusions that a linear relationship is statistically significant when it is not.

```{r}
#| code-fold: true
#| code-summary: "Show the code"
#| label: tbl-globalTempsEstParams
#| tbl-cap: "Parameter estimates for the fitted global temperature time series."

global <- tibble(x = scan("https://byuistats.github.io/timeseries/data/global.dat")) |>
    mutate(
        date = seq(
            ymd("1856-01-01"),
            by = "1 months",
            length.out = n()),
        year = year(date),
        year_month = tsibble::yearmonth(date),
        stats_time = year + c(0:11)/12)

global_ts <- global |>
    as_tsibble(index = year_month)

temp_tidy <- global_ts |> filter(year >= 1970)
temp_lm <- temp_tidy |>
  model(lm = TSLM(x ~ stats_time ))
# tidy(temp_lm) |> pull(estimate)

tidy(temp_lm) |>
  mutate(
    lower = estimate + qnorm(0.025) * std.error,
    upper = estimate + qnorm(0.975) * std.error
  ) |>
  display_table()
```

The 95% confidence interval for the slope does not contain zero. If the errors were independent, this would be conclusive evidence that there is a significant linear relationship between the year and the global temperature.

Note that there is significant positive autocorrelation in the residual series for short lags. This implies that the standard errors will be underestimated, and the confidence interval is inappropriately narrow.

Here are the values of the autocorrelation function of the residuals:

```{r}
#| echo: false
#| warning: false
#| label: tbl-globalTempsResidACF
#| tbl-cap: "Autocorrelation function of the residuals for the global temperature time series."

acf_lag1 <- residuals(temp_lm) |>
  feasts::ACF(var = .resid) |> 
  as_tibble() |> 
  head(1) |> 
  dplyr::select(acf) |> 
  pull()

residuals(temp_lm) |>
  feasts::ACF(var = .resid) |> 
  mutate(acf = round(acf, 3)) |>
  as_tibble() |>
  select(-.model) |> 
  head(10) |>
  pivot_wider(names_from = lag, values_from = acf) |>
  display_table("0.5in")
```

The autocorrelation for $k=1$ is `r acf_lag1 |> round(3)`. We will use this value when we apply the Generalized Least Squares method in the next section.

  
To get an intuitive idea of why ordinary least squares underestimates the standard errors of the estimates, the positive autocorrelation of adjacent observations leads to a series of data that have an effective record length that is shorter than the number of observations. This happens because similar observations tend to be clumped together, and the overall variation is thereby understated.


### Applying Generalized Least Squares, GLS

The autocorrelation in the data make ordinary least squares estimation inappropriate. What caped superhero comes to our rescue? None other than Captain GLS -- the indominable Generalized Least Squares algorithm!

This fitting procedure handles the autocorrelation by maximizing the likelihood of the data, taking into account the autocorrelation in the residuals.
This leads to much more appropriate standard errors for the parameter estimates. We will pass the value of the acf at lag $k=1$ into the regression function.

```{r}
#| output: false

# Load additional packages
pacman::p_load(tidymodels, multilevelmod,
  nlme, broom.mixed)

temp_spec <- linear_reg() |>
  set_engine("gls", correlation = nlme::corAR1(0.706))

temp_gls <- temp_spec |>
  fit(x ~ stats_time, data = global_ts)

tidy(temp_gls) |>
  mutate(
    lower = estimate + qnorm(0.025) * std.error,
    upper = estimate + qnorm(0.975) * std.error
  )
```

```{r}
#| echo: false
#| label: tbl-globalTempsGLS
#| tbl-cap: "Parameter estimates for the Generalized Least Squares estimates of the fit for the global temperature time series."

tidy(temp_gls) |>
  mutate(
    lower = estimate + qnorm(0.025) * std.error,
    upper = estimate + qnorm(0.975) * std.error
  ) |>
  display_table()
```

As we observed visually, there is evidence to suggest that there is a linear trend in the data.


## Class Activity: Linear Models with Seasonal Variables (10 min)

Some time series involve seasonal variables. In this section, we will address one way to include seasonality in a regression analysis of a time series. In the next lesson, we will explore another method.

### Additive Seasonal Indicator Variables


```{r}
#| echo: false
#| warning: false
#| results: asis

library(ggrepel)

# Number of years to show data
number_of_years <- 3

chocolate_month <- rio::import("https://byuistats.github.io/timeseries/data/chocolate.csv") |>
  mutate(
    dates = yearmonth(ym(Month)),
    month = month(dates),
    year = year(dates)
  )

last_full_year <- chocolate_month |>
  group_by(year) |>
  tail(-3) |>
  summarize(n = n()) |>
  filter(n == 12) |>
  ungroup() |>
  summarize(year = max(year)) |>
  select(year) |>
  pull()

first_year <- last_full_year - number_of_years + 1

chocolate_ts <- chocolate_month |>
  filter(year <= last_full_year & year >= first_year) |>
  mutate(month_seq = 1:n())
```

Recall the time series representing the monthly relative number of Google searches for the term "chocolate." Here is a time plot of the data from `r min(chocolate_ts$year)` to `r max(chocolate_ts$year)`. 
Month 1 is `r chocolate_ts |> filter(month_seq == 1) |> dplyr::select(dates) |> pull() |> format("%B %Y")`, 
month 13 is `r chocolate_ts |> filter(month_seq == 13) |> dplyr::select(dates) |> pull() |> format("%B %Y")`,
and
month 25 is `r chocolate_ts |> filter(month_seq == 25) |> dplyr::select(dates) |> pull() |> format("%B %Y")`.

```{r}
#| echo: false
#| warning: false
#| label: fig-chocolatetimeplot
#| fig-cap: "Time plot of a portion of the chocolate search data."

chocolate_scatter_plot <- ggplot(chocolate_ts, aes(month_seq, chocolate)) +
  # data points
  geom_point(color = "#56B4E9", size = 2) +
  # x-axis
  geom_segment(x = 0, xend = number_of_years * 12 + 0.75, y = 0, yend = 0,
               arrow = arrow(length = unit(0.25, "cm"))) +
  geom_text(x = number_of_years * 12 + 0.75, y = 5, label = "x") +
  scale_x_continuous(breaks = 1:(number_of_years * 12)) +
  # y-axis
  geom_segment(x = 0, xend = 0, y = 0, yend = 100,
               arrow = arrow(length = unit(0.25, "cm"))) +
  geom_text(x = 0.5, y = 102, label = "y") +
  ylim(0,102) +
  # labels and formatting
  labs(
    x = "Month",
    y = "Searches",
    title = paste0(
      "Relative Number of Google Searches for 'Chocolate' (",
      min(chocolate_ts$year),
      "-",
      max(chocolate_ts$year),
      ")")
  ) +
  theme_minimal() +
  theme(panel.grid.minor.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))

# Fit regression model
model <- lm(chocolate ~ month_seq, data = chocolate_ts)

# Create data frame for the regression line
reg_line_df <- data.frame(month_seq = c(0, number_of_years * 12 + 1)) |>
  mutate(chocolate = coef(model)[1] + coef(model)[2] * month_seq)

# Create data frame with line that goes through origin
# and is parallel to regression line
zero_int_reg_line_df <- data.frame(month_seq = c(0, number_of_years * 12 + 1)) |>
  mutate(chocolate = coef(model)[2] * month_seq)

arrow_df0 <- chocolate_ts |>
  mutate(
    arrow_start = coef(model)[2] * month_seq,
    deviation = chocolate - arrow_start
  )

arrow_df <- arrow_df0 |>
  group_by(month) |>
  summarize(
    mean_deviation = mean(deviation),
    .groups = 'drop'
  ) |>
  right_join(arrow_df0, by = "month") |>
  arrange(dates)

# Display basic scatterplot
chocolate_scatter_plot
```

We fit a regression line to these data.

```{r}
#| echo: false
#| warning: false
#| label: fig-chocolateaddreg
#| fig-cap: "The regression line is superimposed on the chocolate search time series."

chocolate_scatter_plot +
  # regression line
  geom_smooth(method='lm', formula= y~x, linetype = "dashed", color = "#F0E442", se = FALSE)
```

Now, we draw a line that is parallel to the regression line (has the same slope) but has a Y-intercept of 0.

```{r}
#| echo: false
#| warning: false
#| label: fig-chocolateaddparallel
#| fig-cap: "A line parallel to the regression line that passes through the origin is added to the chocolate search time plot."

chocolate_scatter_plot +
  # regression line
  geom_line(data = reg_line_df, linetype = "dashed", linewidth = 1, color = "#F0E442") +
  # line through the origin, parallel to regression line
  geom_line(data = zero_int_reg_line_df, linewidth = 1, color = "#D55E00")
```

For each month, we find the average amount that the relative number of google seachers that month deviates from the orange line (that is parallel to the regression line and passes through the origin). So, the length of the green line is the same for every January, etc.

```{r}
#| echo: false
#| warning: false
#| label: fig-chocolatelinearModelWithSeasonalIndicatorVariables
#| fig-cap: "Representation of a linear model with seasonal indicator variables for the chocolate search time series data."


chocolate_scatter_plot +
  # regression line
  geom_line(data = reg_line_df, linetype = "dashed", linewidth = 1, color = "#F0E442") +
  # line through the origin, parallel to regression line
  geom_line(data = zero_int_reg_line_df, linewidth = 1, color = "#D55E00") +
  # Vertical arrows
  geom_segment(aes(x = month_seq, xend = month_seq,
                   y = arrow_df$arrow_start, yend = arrow_df$arrow_start + arrow_df$mean_deviation),
               arrow = arrow(length = unit(0.25, "cm")),
               color = "#009E73")
```

When the bottom of the green arrow is on the orange line, the top of the green arrow is the estimate of the value of the time series for that month.

We will create a linear model that includes a constant term for each month. This constant monthly term is called a **seasonal indicator variable**. This name is derived from the fact that each variable indicates (either as 1 or 0) whether a given month is represented. For example, one of the seasonal indicator variables will represent January. It will be equal to 1 for any value of $t$ representing an observation drawn in January and 0 otherwise. Indicator variables are also called **dummy varaibles.**

This additive model with seasonal indicator variables can be perceived similarly to other additive models with a seasonal component:

$$
  x_t = m_t + s_t + z_t
$$
where 
$$
  s_t = 
    \begin{cases}
      \beta_1, & t ~\text{falls in season}~ 1 \\
      \beta_2, & t ~\text{falls in season}~ 2 \\
      ⋮~~~~ & ~~~~~~~~~~~~⋮ \\
      \beta_s, & t ~\text{falls in season}~ s 
    \end{cases}
$$
and $s$ is the number of seasons in one cycle/period, and $n$ is the number of observations, so $t = 1, 2, \ldots, n$ and $i = 1, 2, \ldots, s$, and $z_t$ is the residual error series, which can be autocorrelated.

It is important to note that $m_t$ does not need to be a constant. It can be a linear trend:
$$
  m_t = \alpha_0 + \alpha_1 t
$$
or quadratic:
$$
  m_t = \alpha_0 + \alpha_1 t + \alpha_2 t^2
$$
a polynomial of degree $p$:
$$
  m_t = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + \cdots + \alpha_p t^p
$$
or any other function of $t$.

If $s_t$ has the same value for all corresponding seasons, then we can write the model as:
$$
  x_t = m_t + \beta_{1 + [(t-1) \mod s]} + z_t
$$

Putting this all together, if we have a time series that has a linear trend and monthly observations where $t=1$ corresponds to January, then the model becomes:
\begin{align*}
  x_t 
    &= ~~~~ \alpha_1 t + s_t + z_t \\
    &= 
      \begin{cases}
        \alpha_1 t + \beta_1 + z_t, & t = 1, 13, 25, \ldots ~~~~ ~~(January) \\
        \alpha_1 t + \beta_2 + z_t, & t = 2, 14, 26, \ldots ~~~~ ~~(February) \\
        ~~~~~~~~⋮ & ~~~~~~~~~~~~⋮ \\
        \alpha_1 t + \beta_{12} + z_t, & t = 12, 24, 36, \ldots ~~~~ (December) 
      \end{cases}
\end{align*}

This is the model illustrated in @fig-chocolatelinearModelWithSeasonalIndicatorVariables. The orange line represents the term $\alpha_1 t$ and the green arrows represent the values of $\beta_1, ~ \beta_2, ~ \ldots, ~ \beta_{12}$. 

The folded chunk of code below fits the model to the data and computes the estimated parameter values. 
```{r}
#| code-fold: true
#| code-summary: "Show the code"

chocolate_month <- rio::import("https://byuistats.github.io/timeseries/data/chocolate.csv") |>
  mutate(
    dates = yearmonth(ym(Month)),
    month = month(dates),
    year = year(dates),
    stats_time = year + (month - 1) / 12,
    month_seq = 1:n()
  ) |>
  mutate(month = factor(month)) |>
  as_tsibble(index = dates)

# Fit regression model
chocolate_lm <- chocolate_month |>
  model(TSLM(chocolate ~ 0 + stats_time + month))

# Estimated parameter values
param_est <- chocolate_lm |>
  tidy() |>
  pull(estimate)
```


```{r}
#| echo: false

alpha1 <- param_est |> head(1)
betas <- param_est |> tail(-1)
betas_df <- tibble(
  month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
  betas = betas
)
```

The estimated value of $\alpha_1$ is `r alpha1`. The estimated values for the $\beta_i$ parameters are:

```{r}
#| echo: false
#| label: tbl-chocolateParameterEstIndicator
#| tbl-cap: "Estimated values of $\beta_i$ for the fitted model for chocolate search time series."

first_half <- betas_df |>
  head(6) |>
  pivot_wider(names_from = month, values_from = betas)

last_half <- betas_df |>
  tail(6) |>
  pivot_wider(names_from = month, values_from = betas)

first_half |>
  display_table()

last_half |>
  display_table()
```

Now, we compute forecasted (future) values for the relative number of Google searches for the word "Chocolate."

```{r}
#| code-fold: true
#| code-summary: "Show the code"
#| label: fig-chocForecastPlot
#| fig-cap: "Forecasted values of the relative number of Google searches for the word 'chocolate.'"

num_years_to_forecast <- 5

new_dat <- chocolate_month |>
  as_tibble() |>
  tail(num_years_to_forecast * 12) |>
  dplyr::select(stats_time, month) |>
  mutate(
    stats_time = stats_time + num_years_to_forecast,
    alpha = tidy(chocolate_lm) |> slice(1) |> pull(estimate),
    beta = rep(tidy(chocolate_lm) |> slice(2:13) |> pull(estimate), num_years_to_forecast)
  )

chocolate_forecast <- chocolate_lm |>
  forecast(new_data=as_tsibble(new_dat, index = stats_time)) 

chocolate_forecast |> 
  autoplot(chocolate_month, level = 95) +
  labs(
      x = "Month",
      y = "Relative Count of Google Searches",
      title = paste0("Google Searches for 'Chocolate' (", min(chocolate_month$year), "-", max(chocolate_month$year), ")"),
      subtitle = paste0(num_years_to_forecast, "-Year Forecast Based on a Linear Model with Seasonal Indicator Variables")
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5)
    )

```


<!-- Beginning of two columns -->
::: columns
::: {.column width="45%"}

```{r}
#| code-fold: true
#| code-summary: "Show the code"
#| label: tbl-chocForecastValues
#| tbl-cap: "Select forecasted values of the relative count of Google searches for 'chocolate'."

new_dat |>
  mutate(estimate = alpha * stats_time + beta) |>
  display_partial_table(nrow_head = 4, nrow_tail = 3, decimals = 3)
```

:::

::: {.column width="10%"}
<!-- empty column to create gap -->
:::

::: {.column width="45%"}

@tbl-chocForecastValues gives a few of the forecasted values illustrated in @fig-chocForecastPlot.

:::
:::
<!-- End of two columns -->


## Small Group Activity: Application to Retail Sales Data (15 min)

Recall the monthly U.S. retail sales data. We will consider the total retail sales for women's clothing for 2004-2006. You can use the code below to download the data into a tsibble.

```{r}
# Read in Women's Clothing Retail Sales data
retail_ts <- rio::import("https://byuistats.github.io/timeseries/data/retail_by_business_type.parquet") |>
  filter(naics == 44812) |>
  filter(month >= yearmonth(my("Jan 2004")) & month <= yearmonth(my("Dec 2006"))) |>
  mutate(month_seq = 1:n()) |>
  mutate(year = year(month)) |>
  mutate(month_num = month(month)) |>
  as_tsibble(index = month)
```


```{r}
#| echo: false
#| label: fig-retailWomensClothingTimePlotWithRegressionLine
#| fig-cap: "Time plot of the aggregated monthly sales for women's clothing stores in the United States."
 
# Number of years of data we are considering...it would be better if this was computed from the data
number_of_years <- 3

retail_scatter_plot <- ggplot(retail_ts, aes(month_seq, sales_millions)) +
  # data points
  geom_point(color = "#56B4E9", size = 2) +
  # x-axis
  geom_segment(x = 0, xend = number_of_years * 12 + 0.75, y = 0, yend = 0,
               arrow = arrow(length = unit(0.25, "cm"))) +
  geom_text(x = number_of_years * 12 + 0.75, y = 200, label = "x") +
  scale_x_continuous(breaks = 1:(number_of_years * 12)) +
  # y-axis
  geom_segment(x = 0, xend = 0, y = 0, yend = 5400,
               arrow = arrow(length = unit(0.25, "cm"))) +
  geom_text(x = 0.5, y = 5500, label = "y") +
  ylim(0,5500) +
  # labels and formatting
  labs(
    x = "Month, t",
    y = "Sales (in Millions of U.S. Dollars)",
    title = paste0(
      "Retail Sales: Women's Clothing in Millions (",
      min(retail_ts$year),
      "-",
      max(retail_ts$year),
      ")")
  ) +
  theme_minimal() +
  theme(panel.grid.minor.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))

# Fit regression model
model <- lm(sales_millions ~ month_seq, data = retail_ts)

# Create data frame for the regression line
reg_line_df <- data.frame(month_seq = c(0, number_of_years * 12 + 1)) |>
  mutate(sales_millions = coef(model)[1] + coef(model)[2] * month_seq)

# Create data frame with line that goes through origin
# and is parallel to regression line
zero_int_reg_line_df <- data.frame(month_seq = c(0, number_of_years * 12 + 1)) |>
  mutate(sales_millions = coef(model)[2] * month_seq)

arrow_df0 <- retail_ts |>
  mutate(
    arrow_start = coef(model)[2] * month_seq,
    deviation = sales_millions - arrow_start
  ) |>
  as_tibble()

arrow_df <- arrow_df0 |>
  group_by(month_num) |>
  summarize(
    mean_deviation = mean(deviation),
    .groups = 'drop'
  ) |>
  right_join(arrow_df0, by = "month_num") |>
  arrange(month)

# Display scatter plot with regression line
retail_scatter_plot +
  # regression line
  geom_line(data = reg_line_df, linetype = "dashed", linewidth = 1, color = "#F0E442") 
  # # line through the origin, parallel to regression line
  # geom_line(data = zero_int_reg_line_df, linewidth = 1, color = "#D55E00") +
  # # Vertical arrows
  # geom_segment(aes(x = month_seq, xend = month_seq,
  #                  y = arrow_df$arrow_start, yend = arrow_df$arrow_start + arrow_df$mean_deviation),
  #              arrow = arrow(length = unit(0.25, "cm")),
  #              color = "#009E73")

y_intercept_retail <- coef(model)[1] |> round(0)
slope_retail <- coef(model)[2] |> round(0)
```

We define the model as :
\begin{align*}
  x_t 
    &= ~~~~ \alpha_1 t + s_t + z_t \\
    &= 
      \begin{cases}
        \alpha_1 t + \beta_1 + z_t, & t = 1, 13, 25 ~~~~ ~~(January) \\
        \alpha_1 t + \beta_2 + z_t, & t = 2, 14, 26 ~~~~ ~~(February) \\
        ~~~~~~~~⋮ & ~~~~~~~~~~~~⋮ \\
        \alpha_1 t + \beta_{12} + z_t, & t = 12, 24, 36 ~~~~ (December) 
      \end{cases}
\end{align*}

When we regress the women's clothing retail sales in millions, $x_t$, on the month number, $t$, the estimated simple linear regression equation is:
$$
  \hat x_t = `r y_intercept_retail` + `r slope_retail` ~ t
$$
where $t = 1, 2, 3, \ldots, `r number_of_years * 12 - 1`, `r number_of_years * 12`$.


<!-- Check Your Understanding -->

::: {.callout-tip icon=false title="Check Your Understanding"}

Use the estimated regression equation 
$$
  \hat x_t = `r y_intercept_retail` + `r slope_retail` ~ t
$$
to do the following:

-   Find the equation for the line parallel to the regression line that passes through the origin.
-   Compute the deviation of each observed point from the line you just obtained.
-   Compute the value of $\hat \beta_i$ for each month, where $i = 1, 2, \ldots, 12$.
-   Compute the estimate of the time series for $t = 1, 2, \ldots, 36$.
-   Predict the value of the time series for the next six months.

:::


::: {.callout-tip appearance="minimal"}

```{r}
#| results: asis
#| echo: false
#| label: tbl-retailWomensClothingTable
#| tbl-cap: "Table used to compute the estimate of the women's clothing retail sales time series using a linear model with seasonal indicator variables."

retail_df1 <- retail_ts |>
  as_tibble() |>
  mutate(month_text = format(month, "%b %Y")) |>
  rename(t = month_seq) |>
  dplyr::select(month_text, month_num, t, sales_millions) |>
  mutate(
    alpha1_t = slope_retail * t,
    deviation = sales_millions - alpha1_t
  )

retail_df2 <- retail_df1 |>
  group_by(month_num) |>
  summarize(
    mean_deviation = mean(deviation) |> round(2),
    .groups = 'drop'
  ) |>
  right_join(retail_df1, by = "month_num") |>
  arrange(t)

retail_pred_df <- rio::import("data/retail_by_business_type.parquet") |>
  filter(naics == 44812) |>
  filter(month >= yearmonth(my("Jan 2007")) & month <= yearmonth(my("Jun 2007"))) |>
  mutate(month_seq = (1:n()) + nrow(retail_df1)) |>
  mutate(year = year(month)) |>
  mutate(month_num = month(month)) |>
  mutate(month_text = format(month, "%b %Y")) |>
  as_tibble() |>
  mutate(month_text = format(month, "%b %Y")) |>
  rename(t = month_seq) |>
  dplyr::select(month_text, month_num, t, sales_millions) |>
  mutate(
    alpha1_t = slope_retail * t,
    deviation = sales_millions - alpha1_t
  ) |> 
  mutate(
    sales_millions = as.integer(NA),
    deviation = as.numeric(NA)
  ) |>
  left_join(retail_df2 |> select(month_num, mean_deviation) |> unique(), by = join_by(month_num))  |>
  dplyr::select(month_text, t, #month_num, 
                sales_millions, alpha1_t, deviation, mean_deviation)

retail_df3 <- retail_df2 |>
  dplyr::select(month_text, t, #month_num, 
                sales_millions, alpha1_t, deviation, mean_deviation) |>
  bind_rows(retail_pred_df) |>
  mutate(estimate = alpha1_t + mean_deviation) |>
  replace_na_with_char(emdash) |>
  rename(
    "$$Date$$" = month_text,
    "$$t$$" = t,
    # "$$i$$" = month_num,
    "$$x_t$$" = sales_millions,
    "$$\\hat \\alpha_1 t$$" = alpha1_t,
    "$$x_t - \\hat \\alpha_1 t$$" = deviation,
    "$$\\hat\\beta_i$$" = mean_deviation,
    "$$\\hat x_t$$" = estimate
  )

retail_df3 |>
  replace_cells_with_char(rows = c(2, 14, 26), cols = 6:7)  |>
  replace_cells_with_char(rows = c(3:5, 15:17, 27:29), cols = 4:7) |>
  replace_cells_with_char(rows = 38, cols = 6:7) |>
  replace_cells_with_char(rows = 39:41, cols = c(4,6:7)) |>
  display_table() |>
  column_spec(1:3, width_min = "0.35in") |>
  column_spec(4:ncol(retail_df3), width_min = "1in")

```
:::


## Homework Preview (5 min)

-   Review upcoming homework assignment
-   Clarify questions


::: {.callout-note icon=false}

## Download Homework

<a href="https://byuistats.github.io/timeseries/homework/homework_5_1.qmd" download="homework_5_1.qmd"> homework_5_1.qmd </a>

:::


<a href="javascript:showhide('Solutions1')"
style="font-size:.8em;">Retail Sales</a>
  
::: {#Solutions1 style="display:none;"}
    
```{r}
#| echo: false

retail_df3 |>
  display_table("0.75in")
```

:::


<!-- <a href="javascript:showhide('Solutions')" -->
<!-- style="font-size:.8em;">Class Activity</a> -->

<!-- ::: {#Solutions2 style="display:none;"} -->


<!-- ::: -->


<!-- <a href="javascript:showhide('Solutions3')" -->
<!-- style="font-size:.8em;">Class Activity</a> -->

<!-- ::: {#Solutions3 style="display:none;"} -->


<!-- ::: -->