-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
264 lines (211 loc) · 8.83 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# Mixture of Markov Chains with Support of Higher Orders and Multiple Sequences
<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![CRAN status](https://www.r-pkg.org/badges/version/markovmix)](https://CRAN.R-project.org/package=markovmix)
[![R-CMD-check](https://github.com/zhuxr11/markovmix/workflows/R-CMD-check/badge.svg)](https://github.com/zhuxr11/markovmix/actions)
[![Codecov test coverage](https://codecov.io/gh/zhuxr11/markovmix/branch/master/graph/badge.svg)](https://app.codecov.io/gh/zhuxr11/markovmix?branch=master)
[![Download stats](https://cranlogs.r-pkg.org/badges/grand-total/markovmix)](https://CRAN.R-project.org/package=markovmix)
<!-- badges: end -->
**Package**: [*markovmix*](https://github.com/zhuxr11/markovmix) `r pkgload::pkg_version()`<br />
**Author**: Xiurui Zhu<br />
**Modified**: `r file.info("README.Rmd")$mtime`<br />
**Compiled**: `r Sys.time()`
The goal of `markovmix` is to fit mixture of Markov chains of higher orders from multiple sequences.
It is also compatible with ordinary 1-component, 1-order or single-sequence Markov chains.
Various utility functions are provided to derive transition patterns, transition probabilities per component
and component priors. In addition, `print()`, `predict()` and component extracting/replacing methods
are also defined as a convention of mixture models.
## Installation
You can install the released version of `markovmix` from [CRAN](https://cran.r-project.org/) with:
``` r
install.packages("markovmix")
```
Alternatively, you can install the developmental version of `markovmix` from [github](https://github.com/) with:
``` r
remotes::install_github("zhuxr11/markovmix")
```
## Examples of fitting (mixture of) Markov chains
Before start, we generate some sequences with a finite set of states, just as those to fit a Markov chain.
```{r library, eval=FALSE}
library(markovmix)
```
```{r, include=FALSE}
pkgload::load_all()
```
```{r data-prep}
library(purrr)
# Define Markov states
mk_states <- LETTERS[seq_len(4L)]
# Define number of sequences
mk_size <- 100L
# Define sequence length range
mk_len_range <- seq_len(10L)
# Define a helper function to generate sequence list
gen_seq_list <- function(size, len_range, states, seed = NULL) {
if (is.null(seed) == FALSE) {
set.seed(seed)
}
purrr::map(seq_len(size), ~ sample(states, sample(len_range, 1L), replace = TRUE))
}
mk_seq_list <- gen_seq_list(size = mk_size, len_range = mk_len_range, states = mk_states, seed = 1111L)
head(mk_seq_list)
```
First, we fit a 1-order Markov chain, the most commonly used type.
```{r}
mk_fit <- fit_markov_mix(seq_list = mk_seq_list, states = mk_states)
print(mk_fit)
```
Then, we fit a 2-order Markov chain.
```{r}
mk_fit2 <- fit_markov_mix(seq_list = mk_seq_list, order. = 2L, states = mk_states)
print(mk_fit2)
```
Finally, we fit a 3-component mixture of 2-order Markov chains, with randomly generated soft clustering probabilities for each sequence.
```{r}
# Define number of components
mk_n_comp <- 3L
set.seed(2222L)
mk_clusters <- matrix(runif(length(mk_seq_list) * mk_n_comp), nrow = length(mk_seq_list), ncol = mk_n_comp)
mk_mix_fit2 <- fit_markov_mix(seq_list = mk_seq_list, order. = 2L, states = mk_states, clusters = mk_clusters)
print(mk_mix_fit2)
```
## Examples of predicting (mixture of) Markov chains
Before start, we generate some sequences for prediction.
```{r data-prep-pred}
mk_pred_seq_list <- gen_seq_list(size = 50L, len_range = mk_len_range, states = mk_states, seed = 3333L)
```
We predict the probabilities of each sequence in the generated mixture of Markov chains.
```{r}
mk_pred_res <- predict(mk_mix_fit2, newdata = mk_pred_seq_list)
print(mk_pred_res)
```
Please note that `NA` values are assigned to the probabilities of sequences if no valid sub-sequences (such as those of length `r mk_mix_fit2[["order"]] + 1` or longer).
```{r}
print(purrr::map_int(mk_pred_seq_list[is.na(mk_pred_res) == TRUE], length))
```
To predict the sequences in each component of the mixture of Markov chains, we can use `aggregate. = FALSE`.
```{r}
mk_pred_res_comp <- predict(mk_mix_fit2, newdata = mk_pred_seq_list, aggregate. = FALSE)
print(head(mk_pred_res_comp, n = 10L))
```
## Examples of utility functions
We can use utility functions to derive state transition patterns, transition probabilities per component and component priors. The latter two are handy in predicting probabilities of new sequences.
```{r}
# Derive state transition patterns
print(get_states_mat(mk_mix_fit2), max = 20L)
# Derive probabilities per component
print(get_prob(mk_mix_fit2), max = 20L)
# Derive component priors
print(get_prior(mk_mix_fit2))
```
We can also extract and replace components with new probabilities of observing these states.
```{r}
# Extract 1 component
print(mk_mix_fit2[2L], print_max = 6L, print_min = 6L)
# Extract multiple components
print(mk_mix_fit2[c(1L, 3L)], print_max = 6L, print_min = 6L)
# Replace 1 component with random probabilities
nrow_value <- length(get_states(object = mk_mix_fit2, check = FALSE))^
(get_order(object = mk_mix_fit2, check = FALSE) + 1L)
mk_mix_fit3 <- mk_mix_fit2
mk_mix_fit3[2L] <- runif(nrow_value)
print(mk_mix_fit3, print_max = 6L, print_min = 6L)
# Replace multiple components with random probabilities
mk_mix_fit4 <- mk_mix_fit2
mk_mix_fit4[c(1L, 3L)] <- matrix(runif(nrow_value * 2L), ncol = 2L)
print(mk_mix_fit4, print_max = 6L, print_min = 6L)
# Replace multiple components with a single probability (recycled)
mk_mix_fit5 <- mk_mix_fit2
mk_mix_fit5[1L:2L] <- 0.5
print(mk_mix_fit5, print_max = 6L, print_min = 6L)
```
We can also reorganize states with `restate()` by changing their order, renaming them or merging them.
```{r}
# Reverse states (using function)
print(restate(.object = mk_mix_fit2, .fun = forcats::fct_rev), print_max = 6L, print_min = 6L)
# Reorder states by hand (using function name with additional arguments)
print(restate(
.object = mk_mix_fit2,
.fun = "levels<-",
value = c("B", "D", "C", "A")
), print_max = 6L, print_min = 6L)
# Rename state B with E (using anonymous function)
print(restate(
.object = mk_mix_fit2,
.fun = function(x) forcats::fct_recode(x, "E" = "B")
), print_max = 6L, print_min = 6L)
# Merge state C into D (using purrr-style lambda function)
print(restate(
.object = mk_mix_fit2,
.fun = ~ forcats::fct_recode(.x, "D" = "C")
), print_max = 6L, print_min = 6L)
```
## Time of computation
The core functions of `markovmix` are written with `Rcpp` to efficiently process sequences and probabilities. Here, time of computation (ToC) is profiled with the number of sequences, where dots correspond to ToC medians and error bars to ToC quantiles.
```{r toc, cache=TRUE}
# Define sequence lengths for ToC profiling
mk_toc_seq_len <- as.integer(outer(c(1L, 2L, 5L), 10^(seq_len(5L) - 1L)))
set.seed(4444L)
mk_toc_seq_list <- purrr::map(
mk_toc_seq_len,
~ gen_seq_list(size = .x, len_range = mk_len_range, states = mk_states)
)
mk_toc_clusters <- purrr::map(
mk_toc_seq_len,
~ matrix(runif(.x * mk_n_comp), nrow = .x, ncol = mk_n_comp)
)
mk_toc_exprs <- purrr::map(
purrr::set_names(seq_along(mk_toc_seq_len), mk_toc_seq_len),
~ rlang::expr(fit_markov_mix(
seq_list = mk_toc_seq_list[[!!.x]],
order. = 2L,
states = mk_states,
clusters = mk_toc_clusters[[!!.x]],
verbose = FALSE
))
)
# Profile time of computation
mk_toc <- bench::mark(
exprs = mk_toc_exprs,
min_time = Inf,
iterations = 5L,
filter_gc = FALSE,
check = FALSE
)
```
```{r toc-plots}
mk_toc_plot_data <- mk_toc %>%
tibble::as_tibble() %>%
dplyr::select(expression, time) %>%
dplyr::rename(seq_len = expression, toc = time) %>%
dplyr::mutate_at("seq_len", ~ as.integer(names(.x))) %>%
dplyr::mutate_at("toc", ~ purrr::map(.x, unclass)) %>%
tidyr::unnest("toc")
ggplot2::ggplot(mk_toc_plot_data %>%
dplyr::group_by(seq_len) %>%
dplyr::summarize(toc_median = median(toc, na.rm = TRUE),
toc_min = quantile(toc, 0.25, na.rm = TRUE),
toc_max = quantile(toc, 0.75, na.rm = TRUE),
.groups = "drop"),
ggplot2::aes(x = seq_len, y = toc_median)) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::geom_errorbar(ggplot2::aes(ymin = toc_min, ymax = toc_max)) +
ggplot2::scale_x_log10(breaks = mk_toc_seq_len) +
ggplot2::scale_y_log10() +
ggplot2::theme_bw() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1, vjust = 1)) +
ggplot2::labs(x = "Number of sequences", y = "Time of computation (s)")
```