-
Notifications
You must be signed in to change notification settings - Fork 8
/
paper.tex
2452 lines (2283 loc) · 122 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[9pt,twocolumn,twoside]{gsajnl}
\articletype{inv} % article type
% \usepackage[round]{natbib}
\usepackage{amsmath,amssymb,amsfonts}%
\usepackage{graphicx}
\usepackage{authblk}
\usepackage{tikz}
\usetikzlibrary{math,calc,positioning}
\usepackage{url}
% Uncommment to extract tikz pictures
% \usetikzlibrary{external}
% \tikzexternalize
% Packages from msprime 1.0 paper
% \usepackage{amsmath,amssymb}%
% \usepackage{graphicx}
% \usepackage{authblk}
% \usepackage{dsfont}
% \usepackage{environ}%
% \usepackage{caption}
% \usepackage{subcaption}
% \usepackage{tikz}
% \usetikzlibrary{calc,positioning}
% \usepackage{booktabs}
\newcommand{\noderef}[1]{\textit{#1}}
\newcommand{\tsinfer}[0]{\texttt{tsinfer}}
\newcommand{\kwarg}[0]{\texttt{KwARG}}
\newcommand{\argweaver}[0]{\texttt{ARGweaver}}
\newcommand{\argweaverD}[0]{\texttt{ARGweaver-D}}
\newcommand{\relate}[0]{\texttt{Relate}}
\newcommand{\espalier}[0]{\texttt{Espalier}}
\newcommand{\arbores}[0]{\texttt{Arbores}}
\title{A general and efficient representation of ancestral recombination graphs}
\author[1]{Yan~Wong}
\author[2,3$\star$]{Anastasia~Ignatieva}
\author[4,5$\star$]{Jere~Koskela}
\author[6]{Gregor~Gorjanc}
\author[7,8]{Anthony~W.~Wohns}
\author[1$\dagger$]{Jerome~Kelleher}
% \affil[ ]{\mbox{}\vspace{-2.5em}}
\affil[1]{Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK}
\affil[2]{School of Mathematics and Statistics, University of Glasgow, G12 8TA, UK}
\affil[3]{Department of Statistics, University of Oxford, OX1 3LB, UK}
\affil[4]{School of Mathematics, Statistics and Physics, Newcastle University, NE1 7RU, UK}
\affil[5]{Department of Statistics, University of Warwick, CV4 7AL, UK}
\affil[6]{The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, EH25 9RG, UK}
\affil[7]{Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA}
\affil[8]{Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5101, USA}
\keywords{Ancestral Recombination Graphs}
\runningtitle{Ancestral Recombination Graphs}
\runningauthor{Wong \textit{et al.}}
\begin{abstract}
As a result of recombination, adjacent nucleotides can have
different paths of genetic inheritance and therefore
the genealogical trees for
a sample of DNA sequences vary along the genome.
The structure capturing the details of these intricately interwoven paths of
inheritance is referred to as an ancestral recombination graph (ARG).
% New developments have made it possible to infer ARGs at scale,
% enabling many new applications in population and statistical genetics.
% This rapid progress has led to a diversity of ARG definitions and representations.
Classical formalisms have focused on mapping
coalescence and recombination events to the nodes in an ARG.
This approach is out of step with some modern developments, however,
which do not represent genetic inheritance in terms of these events
or explicitly infer them.
We present a simple formalism that defines an ARG in terms
of specific genomes and their intervals of genetic inheritance,
and show how it generalises these classical treatments and
encompasses the outputs of recent methods. We discuss
nuances arising from this more general structure, and argue that it
forms an appropriate basis for a software standard in this rapidly
growing field.
\end{abstract}
\begin{document}
\maketitle
\thispagestyle{firststyle}
\marginmark
\firstpagefootnote
% Use the \equalcontrib command to mark authors with equal
% contributions, using the relevant superscript numbers
% \equalcontrib{1}
% \equalcontrib{2}
\makeatletter
\newcommand{\equalcontribfirst}[1]{\@authfootnote{#1}{Joint second author,
listed alphabetically.}}
\equalcontribfirst{$\star$} % {Denotes shared first authorship, listed alphabetically}
% \equalcontriblast{$\dagger$} % {Denotes shared senior authorship, listed alphabetically}
% % \equalcontrib{$\ddagger$}{Denotes corresponding author}
% \footnotetext[1]{}
\correspondingauthoraffiliation{$\dagger$}{\url{[email protected]}}
\vspace{-33pt}% Only used for adjusting extra space in the left column of the first page
% \textbf{Keywords:} Ancestral recombination graphs
% \linenumbers
\section{Introduction}
\label{sec-intro}
% Para 1: ARGs are cool!
% NOTE: slight inaccuracy of "DNA" here offset by making the problem concrete.
Estimating the genetic genealogy of a set of DNA sequences
under the influence of recombination,
usually known as an Ancestral Recombination Graph (ARG), is a long-standing
goal in genetics.
Broadly speaking, an ARG describes the different paths of genetic inheritance
caused by recombination, encapsulating the resulting complex web of genetic
ancestors of a set of sampled genomes.
Recent breakthroughs
in large-scale inference
methods~\citep{rasmussen2014genome,kelleher2019inferring,speidel2019method,
schaefer2021ancestral,wohns2022unified,zhang2023biobank,zhan2023towards,deng2024robust}
have raised the realistic prospect of ARG-based analysis becoming a standard part
of the population and statistical genetics toolkit~\citep{hejase2020summary}.
Applications using inferred ARGs as input have begun to
appear~\citep{osmond2021estimating,
fan2022genealogical,
hejase2022deep,
guo2022recombination,
zhang2023biobank,
nowbandegani2023extremely,
ignatieva2023distribution,
fan2023likelihood,
link2023tree,
grundler2024geographic,
huang2024estimating,
korfmann2024simultaneous,
deraje2024inferring}
and many more are sure to
follow~\citep{harris2019database,harris2023using}.
% Para 2: What are they, though?
Although it is widely accepted that ARGs are important, there is some
confusion about what, precisely, an ARG \emph{is}.
% The grammar is a nightmare here - the stochastic process is singular, right?
In its original form,
developed by Griffiths and colleagues,
the ARG is an alternative
formulation of the coalescent with recombination~\citep{hudson1983properties},
where the stochastic process of coalescence and recombination
among ancestral lineages is formalised
as a graph~\citep{griffiths1991two,
ethier1990two,griffiths1996ancestral,griffiths1997ancestral}.
% Move into plural, an ARG is an instance of the data structure in this view
Subsequently, an ARG has come to be thought of as a data
structure~\citep{minichiello2006mapping}, i.e.\ describing
a \emph{realisation} of such a random process,
or an inferred ancestry of a sample of genomes.
The distinction between stochastic process
and data structure is not clear cut, however, and subfields use the term
differently (Appendix~\ref{sec-arg-history},\ref{sec-big-and-little-arg}).
% % TODO rework these two sentences. Point about models vs structures
% % is an important one, we need to make that.
% Recent large-scale methods infer approximate \emph{structures}
% instead of a complete and fully detailed history
% (Appendix~\ref{sec-survey-arg-infer}), and it is
% therefore important to distinguish between
% approximate structures and approximate models.
% Whether an inference method is based
% on heuristics or a rigorous mathematical model is orthogonal
% to the level of detail provided in its estimate.
The term ``ARG'' therefore has many different meanings, varying
over time and depending on context.
There is, however, an emerging consensus to use the term
in quite a general sense~\citep[e.g.][]{mathieson2020ancestry,hejase2020summary,
schaefer2021ancestral,harris2023using,zhang2023biobank,fan2023likelihood},
informally
encompassing the varied structures output by modern simulation and
inference methods~\citep{rasmussen2014genome, palamara2016argon, haller2018tree,
kelleher2019inferring, speidel2019method, baumdicker2021efficient,
zhang2023biobank}.
There is currently no formal definition or systematic
discussion that unifies these different structures,
however, stifling progress in this vibrant research area.
In this perspective we provide a simple formal definition
of an ARG data structure which generalises classical
definitions and encompasses the output of modern simulation and inference
methods. We show that different levels of approximation
are possible using this structure, illustrated via examples.
% This is a bit crap, but do want to make this point about tskit for the
% "only read the introduction" crowd.
The proposed ARG definition is the basis of the widely-used \texttt{tskit}
library which provides a powerful software platform
for ARG-based analysis and, we argue, would be a useful community standard.
This perspective is intended for ``ARG practitioners'', who we hope
will find the detailed examples, technical appendices, and
comprehensive bibliography useful.
Readers seeking an introduction to ARGs and their applications are
directed to \citet{lewanski2024era}
and \citet{brandt2024promise}.
\section{Genome ARGs}
\label{sec-gARG}
We define a genome as the complete set of genetic material that a child
inherits from one parent. A diploid individual
% Q: will someone nitpick that we're assuming diploid obligate sexual species?
therefore carries two genomes, one inherited from each parent (we assume diploids and consider nuclear autosomal DNA here
for clarity, but the definitions apply to organisms of arbitrary ploidy).
We will also use the term ``genome'' in its
more common sense of ``the genome'' of a species,
and hope that the distinction will be clear from the context.
We are not concerned here with mutational processes or observed sequences,
but consider only processes of inheritance,
following the standard practice in coalescent theory.
We also do not consider structural variation, and assume that all
samples and ancestors share the same genome coordinate space.
A genome ARG (gARG) is a directed acyclic graph in which nodes represent
haploid genomes and edges represent
genetic inheritance between an ancestor and a descendant.
The topology of a gARG specifies that genetic inheritance
occurred between
ancestors and descendants, but the graph connectivity
does not tell us which \emph{parts} of their genomes were inherited.
In order to capture the effects of recombination
we ``annotate'' the edges with the genome
coordinates over which inheritance occurred.
This is sufficient to describe the effects of inheritance under
any form of homologous recombination (such as multiple crossovers during a single round of meiosis,
gene conversion events, and many forms of bacterial and viral recombination).
We can define a gARG formally as follows.
Let $N = \{1, \dots, n\}$ be the set of nodes representing
the genomes in the gARG,
and $S \subseteq N$ be the set of sampled genomes.
Then, $E$ is the set of edges, where each element
is a tuple $(c, p, I)$ such that $c, p \in N$ are the child and
parent nodes and $I$ is the set of disjoint genomic intervals
over which genome $c$ inherits from $p$.
Thus, each topological connection between
a parent and child node in the graph is annotated with a set of
inheritance intervals $I$.
Here, the terms parent and child are used in the graph sense;
these nodes respectively represent ancestor and descendant genomes,
which can be separated by multiple generations.
We will use these two sets of terms interchangeably.
How nodes are interpreted, exactly, is application dependent.
Following \citet{hudson1983properties}, we can view nodes
as representing gametes, or we can imagine them representing,
for example, the genomes present in cells immediately before or after
some instantaneous event (Appendix~\ref{sec-cell-lineages-and-args}).
A node can represent any genome along a chain of cell divisions
or can be interpreted as representing one of the genomes of a
potentially long-lived individual.
In many settings, nodes are dated, i.e.\ each
node $u\in N$ is associated with a time $\tau_u$,
and how we assign precise times will vary by application.
The topological ordering defined by the directed graph structure
and an arrow of time (telling us which direction is pastwards)
is sufficient for many applications, however,
and we assume node dates are not known here.
% This should be obvious, but seems like it needs to be said
In practical settings, we will wish to associate additional
metadata with nodes such as sample identifiers or quality-control metrics.
It is therefore best to think of the
integers used here in the definition of a node as an \emph{identifier},
with which arbitrary additional information can be associated.
\begin{figure*}
\begin{center}
\includegraphics[width=\textwidth]{illustrations/arg-in-pedigree}
\end{center}
\caption{\label{fig-arg-in-pedigree}
An example genome ARG (gARG) embedded in a pedigree.
\textbf{a.} Diploid individuals (shaded background squares / circles), visualised in a highly inbred pedigree and
labelled $d_1$ to $d_8$,
contain both paternal and maternal genomes
labelled \noderef{A} to \noderef{P}. Lines show inheritance paths connecting
genomes in the current generation (\noderef{A} to \noderef{D}) with their ancestors.
Genomes \noderef{A} and \noderef{C} are the product of two independent
meioses (recombination events, with italicised breakpoint positions) between
the paternal genomes \noderef{E}
and \noderef{F}; the regions of inherited genome are shaded.
Further shading highlights genomic regions that merge into a common ancestor:
merged regions gradually darken until fully coalesced.
\textbf{b.} The corresponding gARG along with inheritance annotations on all edges
(partial genomic inheritance in bold).
\textbf{c.} The corresponding local trees.
}
\end{figure*}
As illustrated in Fig.~\ref{fig-arg-in-pedigree},
the gARG for a given set of individuals is embedded in their pedigree.
The figure shows the pedigree of eight diploid individuals
and their sixteen constituent genomes (each consisting of a single chromosome),
along with paths of genetic inheritance.
Here, and throughout,
nodes are labelled with uppercase alphabetical letters
rather than integer identifiers to avoid confusion with genomic intervals.
Thus individual $d_1$ is composed
of genomes \noderef{A} and \noderef{B}, which are inherited from its
two parents $d_3$ and $d_4$. Each inherited genome may be the recombined product
of the two genomes belonging to an individual parent.
In this example,
genome \noderef{B} was inherited directly from $d_4$'s genome \noderef{G} without
recombination, whereas
genome \noderef{A} is the recombinant product of
$d_2$'s genomes \noderef{E} and \noderef{F} crossing over at position 2.
Specifically, genome \noderef{A} inherited the (half-closed)
interval $[0, 2)$ from genome \noderef{E} and $[2, 10)$ from genome \noderef{F}.
These intervals are shown attached to the corresponding graph edges.
The figure shows the annotated pedigree with realised inheritance of genomes
between generations (a), the corresponding gARG (b), and finally the corresponding
sequence of local trees along the
genome (c).
The local trees span the three genome regions delineated
by the two recombination breakpoints that gave rise to these genomes;
see Appendix~\ref{sec-ARG-and-local-trees} for details
on how local trees are embedded in an ARG.
% % % I feel like sophisticated readers would at this point be saying,
% % % ah, yeah, that's basically what an ARG is just in different words;
% % % what's the point of this paper?
% The genome ARG framework defined here is
% in many ways simply a clarification of existing treatments
% \cite[e.g.][]{mathieson2020ancestry,shipilina2023origin},
% adding concrete details to describe the
% differential inheritance of genetic material between genomes.
% % % This is definitely needed for the Nicks, who are always thinking
% % % about the deeper processes and are quite happy being loosey-goosey
% % % with representation details
% In the next section we contrast this gARG representation with a classical event-based (eARG) representation of the same underlying process. The choice between using a gARG or an eARG to represent the general process of ancestral inheritance has
% % Again, need to remind the reader that there's an actual point
% % to all this getting stuck into the details.
% important
% consequences not just when addressing the practical need to store and compare ARGs, but more fundamentally when establishing goals for inferring ARGs and evaluating the success of such inferences.
\section{Event ARGs}
\label{sec-eARG}
A classical view of an ARG data structure,
described explicitly in several publications~\citep[e.g.][]{
wiuf1999recombination,gusfield2014recombinatorics,hayman2023recoverability},
interprets nodes not as genomes but as historical \emph{events}
(but see~\citet{parida2011minimal} and \citet{zhang2023biobank}
for notable exceptions).
This Event ARG (eARG) encoding is the basis of the output formats
created by multiple ARG inference tools
\citep[e.g.][]{song2004minimum,song2005efficient,rasmussen2014genome,
heine2018bridging,ignatieva2021kwarg}.
% Griffiths never mentions "sampling events", so let's not get
% into that. It's not an important detail, and we don't want to be
% overly precise about something that people haven't actually written
% down.
In this encoding there are two types of internal node in the graph,
representing the most recent common ancestor and recombination events
in the history of a sample.
At common ancestor nodes, the inbound lineages merge into a
single ancestral lineage with one parent, and at recombination
nodes a single lineage is split into two independent
ancestral lineages. Recombination nodes are annotated with
the corresponding crossover breakpoints, and these breakpoints
are used to construct the local trees.
This is done by tracing pastwards through the graph from the samples,
making decisions about which outbound edge to follow through
recombination nodes based on the breakpoint
position~\citep{griffiths1996ancestral}. Fig.~\ref{fig-event-arg} shows an example of an
eARG with three sample genomes (\noderef{A}, \noderef{B}, and \noderef{C}),
three common ancestor events (\noderef{E}, \noderef{F}, and \noderef{G})
and a single recombination event (node \noderef{D}) with a breakpoint
at position $x$.
Assigning a breakpoint to a recombination node is
not sufficient to uniquely define the local trees, and either
some additional ordering rules~\citep[e.g.][]{griffiths1996ancestral} or
explicit information~\citep[e.g.][]{gusfield2014recombinatorics,ignatieva2021kwarg}
is required to distinguish the left and right parents.
We assume in Fig.~\ref{fig-event-arg} that \noderef{D} inherits genetic material to the
left of $x$ from \noderef{E} and to the right of $x$ from \noderef{F}.
\begin{figure}
\centering
\tikzmath{\x1 =0; \x2=8.5;\xx=12.5; \x3=13.5; \xt=18;}
\begin{tikzpicture}[x=5mm, y=5mm, node distance=2mm and 20mm]
\tikzset{greynode/.style={circle,fill,inner sep=1},
nodelabel/.style={font=\footnotesize}}
\node [anchor=north west] at (\x1-.5,5.5) {\textsf{\textbf{\LARGE a}}};
\node [anchor=north west] at (\x2-.5,5.5) {\textsf{\textbf{\LARGE b}}};
% \node [anchor=north west] at (\xt,6) {C};
%%% (a) ARG
\node (s0) [greynode] at (\x1 + 0, 0) {};
\node (s1) [greynode] at (\x1 + 3, 0) {};
\node (s2) [greynode] at (\x1 + 6, 0) {};
\node (s3) [greynode] at (\x1 + 3, 1) {};
\node (s4) [greynode] at (\x1 + 1, 2) {};
\node (s5) [greynode] at (\x1 + 5, 3) {};
\node (s6) [greynode] at (\x1 + 3, 4) {};
\draw (s1) -- (s3);
\draw (s0) |- (s4);
\draw (s4) -- (\x1 + 2,2) |- (s3);
\draw (s4) |- (s6);
\draw (s3) -- (\x1 + 4,1) |- (s5);
\draw (s2) |- (s5);
\draw (s5) |- (s6);
%%% (b) Trees
\node (l0) [greynode] at (\x2 + 0, 0) {};
\node (l1) [greynode] at (\x2 + 2, 0) {};
\node (l2) [greynode] at (\x2 + 3, 0) {};
\node (l3) [greynode] at (\x2 + 1, 2) {};
\node (l4) [greynode] at (\x2 + 2, 4) {};
\draw (l0) |- (l3);
\draw (l1) |- (l3);
\draw (l2) |- (l4);
\draw (l3) |- (l4);
\node (r0) [greynode] at (\x3 + 0, 0) {};
\node (r1) [greynode] at (\x3 + 1, 0) {};
\node (r2) [greynode] at (\x3 + 3, 0) {};
\node (r3) [greynode] at (\x3 + 2, 3) {};
\node (r4) [greynode] at (\x3 + 1, 4) {};
\draw (r0) |- (r4);
\draw (r1) |- (r3);
\draw (r2) |- (r3);
\draw (r3) |- (r4);
\foreach \u/\lab in {
s0/$\noderef{A}$, s1/$\noderef{B}$, s2/$\noderef{B}$,
r0/$\noderef{A}$, r1/$\noderef{B}$, r2/$\noderef{B}$,
l0/$\noderef{A}$, l1/$\noderef{B}$, l2/$\noderef{B}$}
\node[nodelabel,anchor=south] at ([yshift=-12pt]\u) {\lab};
\foreach \u/\lab in {
s6/$\noderef{G}$,
l4/$\noderef{G}$,
r4/$\noderef{G}$}
\node[nodelabel,anchor=south] at (\u) {\lab};
\node [nodelabel,anchor=north west] at ($(s3) + (\x1 + 0,0)$) {$x$};
\foreach \u/\lab in {
s4/$\noderef{E}$,
l3/$\noderef{E}$}
\node[nodelabel,anchor=south west] at (\u) {\lab};
\foreach \u/\lab in {
s5/$\noderef{F}$,
r3/$\noderef{F}$}
\node[nodelabel,anchor=south east] at (\u) {\lab};
\foreach \u/\lab in {s3/$\noderef{D}$, s6/$\noderef{G}$}
\node[nodelabel,anchor=south] at (\u) {\lab};
\draw[dashed] (\xx,0) -- (\xx, 4);
\node[nodelabel,anchor=north] at (\xx,0) {$x$};
% %%% (c) Encoding
% \node [nodelabel,anchor=north west] at ($(\xt,5)$) {
% \begin{tabular}{c|c|l}
% % \multicolumn{2}{c}{Breakpoints}\\
% Node & Breakpoint & Parents\\
% \hline
% $\noderef{A}$ & $\varnothing$ & [\noderef{E}]\\
% $\noderef{B}$ & $\varnothing$ & [\noderef{D}] \\
% $\noderef{C}$ & $\varnothing$ & [\noderef{F}]\\
% $\noderef{D}$ & $x$ & [\noderef{E}, \noderef{F}] \\
% $\noderef{E}$ & $\varnothing$ & [\noderef{G}]\\
% $\noderef{F}$ & $\varnothing$ & [\noderef{G}]\\
% $\noderef{G}$ & $\varnothing$ & []\\
% \end{tabular}};
\end{tikzpicture}
\caption{\label{fig-event-arg}
A classical event ARG (eARG). \textbf{a.} Standard graph depiction with
breakpoint $x$ associated with the recombination node \noderef{D}.
Nodes \noderef{E}, \noderef{F} and \noderef{G} are common ancestor events.
\textbf{b.} Corresponding local trees to the left and right of breakpoint $x$
(note these are shown in the conventional form in which only coalescences
within the local tree are included, hence \noderef{D} is omitted; see
Appendix~\ref{sec-ARG-and-local-trees} for a discussion of this important point).
}
\end{figure}
% eARGs are representationally for limited interchange
While this approach of annotating recombination nodes with a
breakpoint in an eARG is a concise and elegant way of describing realisations
of the coalescent, it has limitations.
The eARG encoding explicitly models only
two different types of event; thus anything that is not a single crossover
recombination or common ancestor event must be incorporated
either in a roundabout way using these
events, or by adding new types of event to the encoding. For example, gene
% Graham asked for coalescences with GC papers, but this is the only one
% I know of?
conversion~\citep{wiuf2000coalescent} could
be accommodated either by stipulating a third type of event
(annotated by two breakpoints and corresponding traversal conventions for
recovering the local trees) or by two recombination nodes joined by a
zero-length edge. The gARG encoding described in the previous
section offers a simpler and more direct solution.
Aside from these practical challenges, there is also a deeper
issue with the implicit strategy of basing an ARG data structure on
recording events and their properties (e.g.\ the crossover breakpoint
for a recombination event).
This approach
requires all events to be recorded explicitly, and does not
provide an obvious mechanism for aggregating multiple, potentially
unresolvable, events.
As datasets approach the population scale~\citep[e.g.][]{
turnbull2018hundred, bycroft2018genome,hayes20191000,
Ros-Freixedes2020,karczewski2020mutational,tanjo2021practical,
halldorsson2022sequences} representing such uncertainty
directly through the data structure is a useful alternative to
classical methods based on probabilistic sampling.
% There is also a certain clarity gained by explicitly modelling nodes
% in the inheritance graph as genomes.
% Outside of the context of a
% % The point being, events are perfectly well defined in the models
% % but not in reality
% mathematical model, an ``event'' is a slippery concept.
% For example, \emph{which} genome along a chain of cell divisions should be
% regarded as the one where an event occurred,
% or whether multiple coalescences
% within a single individual should be regarded as one or multiple events are
% debatable points (Appendix~\ref{sec-cell-lineages-and-args}).
% % If we want to have an ARG software ecosystem then such fireside
% % discussions won't help.
% % The gARG formulation avoids the need for such considerations,
% % so is more naturally compatible with forming an ecosystem of interoperable
% % inference and analysis methods.
\section{Ancestral material and sample resolution}
\label{sec-ancestral-material}
Ancestral material~\citep{wiuf1999ancestry,wiuf1999recombination}
is a key concept in understanding the overall inheritance structure
of an ARG.
It denotes the genomic intervals ancestral to a set of samples
on the edges of an ARG.
For example, in Fig.~\ref{fig-arg-in-pedigree} we have
four sample genomes, \noderef{A}--\noderef{D}. As we
trace their genetic ancestry into the previous generation
(\noderef{E}--\noderef{H}), we can think of their ancestral
material propagating through the graph
pastwards. In the region $[2, 7)$, there is a
local coalescence where nodes \noderef{A} and \noderef{C}
find a common ancestor in \noderef{F}. Thus, in this region,
we have three genome segments that are ancestral to the
four samples. Fig.~\ref{fig-arg-in-pedigree}a
illustrates this by (shaded) ancestral material being present
in only three nodes (\noderef{F}, \noderef{G}, and \noderef{H}) in this region,
while node \noderef{E} is blank
as it carries \emph{non-ancestral} material.
This process of local coalescence continues through the
graph, until all samples reach their most recent common
ancestor in node \noderef{N}.
The process of tracking local coalescences and updating
segments of ancestral material is a core element of
Hudson's seminal simulation
algorithm~\citep{hudson1983testing,kelleher2016efficient}.
The ability to store resolved ancestral material
is also a key distinction between the eARG and gARG
encodings. Because an eARG stores only the graph topology and
recombination breakpoints, there is no way to locally
ascertain ancestral material without traversing the graph
pastwards from the sample nodes,
resolving the effects of recombination and common ancestor events.
Efficiently propagating and resolving ancestral material for
a sample through a pre-existing graph is a well-studied problem,
and central to recent advances in individual-based forward-time
simulations~\citep{kelleher2018efficient,haller2018tree}.
In contrast to the usual ``retrospective'' view of ARGs
discussed so far, these methods record an ARG forwards in
time in a ``prospective'' manner. Genetic inheritance relationships
and mutations are recorded exhaustively, generation-by-generation,
leading to a rapid build-up of information, much of which
will not be relevant to the genetic ancestry of a future population.
This redundancy is periodically removed using the ``simplify''
algorithm~\citep{kelleher2018efficient}, which propagates and
resolves ancestral material.
Efficient simplification is the key enabling factor for
this prospective-ARG based approach to forward-time simulation,
which can be orders of magnitude faster than standard
sequence-based methods
(see Appendix~\ref{sec-ARG-simplification} for
other applications of ARG simplification).
% Define "sample-resolved" here for later use
We refer to a gARG that has been simplified with respect to a set of
samples, such that the inheritance annotations on its edges contain
no non-ancestral material, as sample-resolved.
\begin{figure*}
\centering
\includegraphics[width=\textwidth]{illustrations/ancestry-resolution}
\caption{\label{fig-ancestry-resolution}
Converting the \citet[][Fig.~1]{wiuf1999recombination} example
to a sample-resolved gARG. \textbf{a.} The original eARG; nodes
represent sampling, common ancestor, and recombination events (small shaded, small blue, and large red rectangles respectively); the latter contain breakpoint positions.
\textbf{b.} The corresponding gARG with breakpoints directly converted to
edges annotated with inheritance intervals.
\textbf{c.} The sample-resolved gARG resulting from simplifying with respect
to the sample genomes, \noderef{A}, \noderef{B}, and \noderef{C}.
Dashed lines show edges that are
no longer present (in practice, nodes \noderef{G}, \noderef{J}, and \noderef{Q} would also be removed).
Coalescence with respect to the sample is indicated by shaded bars, as
in Fig.~\ref{fig-arg-in-pedigree}a; nodes \noderef{N}, \noderef{O}, \noderef{P}, \noderef{Q} have truncated
bars showing that local ancestry of entirely coalesced regions is omitted.
Line thickness is proportional to the genomic span of each edge.
Nodes representing recombination events are retained
for clarity, but could be removed by simplification if
desired.
}
\end{figure*}
Any eARG can be converted to a sample-resolved gARG
via a two-step process illustrated in Fig.~\ref{fig-ancestry-resolution}.
The first step is to take the input eARG (Fig.~\ref{fig-ancestry-resolution}a),
duplicate its graph topology, and then add inheritance annotations
to each of the gARG's edges (Fig.~\ref{fig-ancestry-resolution}b) as follows.
If a given node is a common ancestor event, we annotate the single
outbound edge with the interval $[0,L)$, for a genome of length $L$. If the
node is a recombination event with a breakpoint $x$, we annotate the two
outbound edges respectively with the intervals $[0, x)$ and $[x, L)$. These
inheritance interval annotations are clearly in one-to-one correspondence with
the information in the input eARG. They are also analogous to the
inheritance intervals we get on the edges in a prospective gARG
produced by a forward-time simulation, which are concerned with recording
the direct genetic relationship between a parent and child genome and are not
necessarily minimal in terms of the ancestral material of a sample.
Thus, the final step is to use the ``simplify'' algorithm to resolve the
ancestral material of the samples (Fig.~\ref{fig-ancestry-resolution}c).
The sample-resolved gARG of Fig.~\ref{fig-ancestry-resolution}c
differs in some important ways to the
original eARG (Fig.~\ref{fig-ancestry-resolution}a).
Firstly, we can see that some nodes and edges have been removed entirely
from the graph.
The ``grand MRCA'' \noderef{Q} is omitted from the
sample-resolved gARG because all segments of the genome have
fully coalesced in \noderef{K} and \noderef{P} before \noderef{Q} is reached.
Likewise, the edge
between \noderef{G} and \noderef{J} is omitted because the recombination
event at position $5$ (represented by node \noderef{G})
fell in non-ancestral material.
More generally, we can see that the sample resolved
gARG of Fig.~\ref{fig-ancestry-resolution}c
allows for ``local'' inspection
of an ARG in a way that is not possible in an eARG.
Because the ancestral material is stored with each edge of a gARG, the
cumulative effects of events over time can be reasoned
about, without first ``replaying'' those events.
Many computations
that we wish to perform on an ARG will require resolving
the ancestral material with respect to a set of samples.
% this is weak, but would be good to emphasise this point.
% you just end up running simplify lots of times, in your
% downstream algs
The gARG encoding
allows us to perform this once
and to store the result,
whereas the eARG encoding requires us to repeat the process
each time.
% Note that the \citet{wiuf1999recombination} eARG
% in Fig.~\ref{fig-ancestry-resolution} is not particularly
% representative, because inference or simulation methods usually
% only generate ARGs containing nodes and edges ancestral to the sample
% (but see the discussion of the ``Big ARG'' stochastic process in
% Appendix~\ref{sec-big-and-little-arg}).
% Nonetheless, it is an instructive example from the literature which highlights several
% important properties of ARGs, and the general point about
% the need to resolve ancestral material ``on the fly'' for eARG traversals
% holds.
\section{A diversity of structures}
\begin{figure*} \begin{center}
\includegraphics[width=\textwidth]{illustrations/inference.pdf} \end{center}
\caption{\label{fig-inferred-args} Inference of sample-resolved ARGs for 11
\textit{Drosophila melanogaster} DNA sequences over a 2.4kb
region of the ADH locus~\citep{kreitman1983nucleotide}.
Results for four different methods:
\textbf{a.} \kwarg; \textbf{b.} \argweaver; \textbf{c.} \tsinfer; and \textbf{d.} \relate, converted
to the standard \emph{tskit} gARG encoding. See
Appendix \ref{sec-example-inferred-args} for details of these methods.
Edge colours indicate time of the edge's child node
(lighter: older; darker: younger).
Vertical and horizontal positions of graph nodes are arbitrary.
Line width and node colour are as described in Fig.~\ref{fig-simplification}.
Bottom row graphics show the genome positions, relative to the start of
the ADH gene, for each graph edge from the corresponding ARG. Edge intervals
are drawn as horizontal lines, stacked in time order (edges with youngest
children at the bottom); vertical dashed lines denote breakpoints between
local trees.
}
\end{figure*}
A key goal of this perspective is to highlight the heterogeneity of
the graph structures inferred by modern ARG inference methods.
To illustrate this point, Fig.~\ref{fig-inferred-args} shows the
output of
\kwarg~\citep{ignatieva2021kwarg},
\argweaver~\citep{rasmussen2014genome},
\tsinfer~\citep{kelleher2019inferring},
and \relate~\citep{speidel2019method}
on the classical \cite{kreitman1983nucleotide} dataset.
The ARGs in Fig.~\ref{fig-inferred-args}a, b are precise
estimates (Appendix~\ref{sec-precision}),
with each node corresponding to a common ancestor
or recombination event, or equivalently, either having two children
or two parents.
In contrast the ARGs in Fig.~\ref{fig-inferred-args}c, d do not
have this clear-cut interpretation, and the nodes
can simultaneously have more that than two children and more
than two parents. Another dimension of variability among the ARGs
is that the first three methods infer nodes that have
a ``coalescence span'' greater than 0 and less than 100\%, indicating that there
are nodes that are ``locally unary'' (Appendix~\ref{sec-locally-unary-edges}),
but mark a coalescence between lineages elsewhere along the sequence.
\begin{figure*}
\centering
\includegraphics[width=\textwidth]{illustrations/simplification}
\caption{\label{fig-simplification}
Levels of ARG simplification.
\textbf{a.} An example gARG simulated from a diploid Wright-Fisher model.
\textbf{b.} Remove all
singly-connected graph components (e.g., diamonds such as \noderef{JLMN}).
\textbf{c.} Remove nodes that never represent coalescences,
i.e.\ are unary everywhere (e.g.\ \noderef{N} and \noderef{R}).
\textbf{d.} Rewrite edges to bypass nodes in local trees in which they are unary
(often described as ``fully simplified'').
In each case, the graph is shown on the left
and corresponding local trees on the right.
% Repetition here, but I'm thinking about the casual "I'm just reading
% the captions reader" who might not realise that there's an essential
% ingredient missing from the graph version
In the interest of visual clarity, inheritance intervals are not shown
on the graph edges; Supplementary Fig.~S1
shows the graphs with these inheritance intervals included.
Graph nodes are coloured by the number of parents and shaded
according to the proportion of their span over which they are coalescent;
see the text for more details.
}
\end{figure*}
A key feature of the gARG encoding is that it enables these varying levels
of precision to be represented.
These ideas are illustrated in Fig.~\ref{fig-simplification}, which shows
different levels of ``simplification'' (Appendix~\ref{sec-ARG-simplification})
of the same underlying simulated ARG. The full ARG, with all coalescent and
recombination events represented by separate genomes, is shown in
Fig.~\ref{fig-simplification}a. Simpler representations can be formed by
removing ``unknowable'' nodes such as those in singly-connected graph components (Fig.~\ref{fig-simplification}b) and collapsing
multiple recombinations into a single child or multiple coalescences into a
single parent (Fig.~\ref{fig-simplification}c).
Finally, Fig.~\ref{fig-simplification}d is a ``fully simplified''
ARG, in which only coalescences in local trees are retained.
Note that while ARGs of this type (produced by default by the \texttt{msprime}
simulator, for example) lack a significant level of detail, they still
retain the key feature of shared node identity across local trees.
This ability to represent an ARG to differing degrees of precision is a powerful
feature. In particular, when inferring ARGs from genome
sequencing data, the timing, positions, and even the number of recombination
events is generally not possible to infer precisely. For example, under
coalescent-based models, the proportion of recombination events that change the
ARG topology grows very slowly with sample size \citep{hein2004gene}, and of those
events only a small proportion are actually detectable from the data, assuming
human-like mutation and recombination
rates \citep{myers2002detection,hayman2023recoverability}.
Even when a recombination event \emph{is} detectable, its timing and breakpoint
position can only be inferred approximately, depending on how much information
can be elucidated from mutations in the surrounding genomic region.
A gARG can encode a diversity of ARG structures, including
those where events \emph{are} recorded explicitly, and those where
they are treated as fundamentally uncertain and thus not explicitly inferred (Appendix~\ref{sec-precision}).
% Because a gARG
% can encode a diversity of ARG structures, it
% allows this fundamental uncertainty in inference to be appropriately represented
% (Appendix~\ref{sec-precision})
% For general
% ARGs, for instance those representing pedigrees, or genealogies of organisms with
% low mutation rates and complex patterns of recombination, such inference can be
% even more challenging.
%The fact that the eARG encoding \emph{requires}
%precise information about recombination is therefore a fundamental limitation.
\section{Implementation and efficiency}
\label{sec-efficiency}
The gARG encoding can lead to highly efficient storage and processing of ARG data,
and has been in use for several years.
The succinct tree sequence data structure
(usually known as a ``tree sequence'' for brevity)
is a practical gARG implementation focused on efficiency.
It was originally developed as part of the \texttt{msprime}
simulator~\citep{kelleher2016efficient} and has subsequently been
extended and applied to forward-time
simulations~\citep{kelleher2018efficient,haller2018tree},
inference from
data~\citep{kelleher2019inferring,wohns2022unified,zhan2023towards},
and calculation of population genetics statistics~\citep{ralph2020efficiently}.
The succinct tree sequence encoding extends the basic definition
of a gARG provided here by stipulating a
simple tabular representation of nodes and edges,
and also defining a concise representation of
sequence variation using the ``site'' and ``mutation'' tables.
The key property of the succinct tree sequence encoding
that makes it an efficient substrate for defining analysis
algorithms is that it allows us to sequentially
recover the local trees along the genome very efficiently,
and in a way that allows us to reason about the \emph{differences}
between those trees~\citep{kelleher2016efficient,ralph2020efficiently}.
The \texttt{tskit} library is a liberally
licensed open source toolkit that provides a comprehensive suite
of tools for working with gARGs (encoded as a succinct tree sequence).
Based on core functionality written
in C, it provides interfaces in C, Python and Rust.
Tskit is mature software, widely used in population genetics, and
has been incorporated into numerous downstream
applications~\citep[e.g.,][]{haller2019slim,speidel2019method,
adrion2020community,
terasaki2021geonomics,
baumdicker2021efficient,
fan2022genealogical,
guo2022recombination,
korfmann2022weak,
mahmoudi2022bayesian,
petr2023slendr,
rasmussen2022espalier,
zhang2023biobank,
nowbandegani2023extremely,
ignatieva2023distribution,
fan2023likelihood,
tsambos2023link,
tagami2024tstrait,
korfmann2024simultaneous}.
The technical details of \texttt{tskit}, and how it provides an
efficient and portable platform for ARG-based analysis, are beyond
the scope of this manuscript.
% In the interest of avoiding confusion, however, we list a
% few minor details in which the formal details of gARGs
% provided in Section~\ref{sec-gARG} differ from their practical implementation in
% \texttt{tskit}.
% Firstly, ``edges'' in tree sequence terminology would perhaps be better
% described as ``edge-intervals'', as each describes a single contiguous
% interval of genome inheritance between a pair of nodes.
% This denormalisation of the gARG data model is for efficiency purposes.
% Secondly, zero- rather than one-based indexing is used for
% nodes in ARGs and oriented trees; consequently $-1$ is used to denote the presence of
% roots (rather than $0$ as used here for notational simplicity).
\section{Discussion}
\label{sec-discussion}
% Recent breakthroughs have finally made large-scale ARG inference
% feasible in practice, leading to a surge of interest
% in inference methods, their evaluation, and their application to biological questions.
% The prospect of ARGs being used routinely within population
% and statistical genetics is tantalising,
% but in reality there is substantial work to be done to
% enable this.
% A necessary first step is a degree of terminological clarity.
% As reviewed in Appendix~\ref{sec-arg-history}, the term
% ``ancestral recombination graph" has several
% subtly different interpretations, depending on context.
% The trend to decouple ARGs from their original definition
% within the context of stochastic
% processes and instead use the term as a more general representation of any
% recombinant genetic ancestry seems useful.
% Our intent here is to clarify and systematise this decoupling. Thus
% we can think of an ARG as any structure that encodes the
% reticulate genetic ancestry of a sample of colinear sequences under
% the influence of recombination. The ``genome'' ARG (gARG) encoding
% is one way we can concretely
% define such recombinant ancestry, which we have shown is both
% flexible and efficient.
% The flexibility of the gARG encoding contrasts with the classical
% ``event'' ARG (eARG) encoding, which is more limited in what can be described.
% Importantly, gARGs do not require fully precise estimates of
% ancestral recombination events,
% and allow us to directly express important forms of temporal uncertainty.
% Fully decoupling the general concept of an ARG from the coalescent
% with recombination (henceforth, ``coalescent'')
% is an important step.
% While the coalescent has proven to be a useful and
% robust
% model~\citep{wakeley2012gene,bhaskar2014distortion,nelson2020accounting},
% many modern datasets have properties that grossly
% violate its assumptions.
% One key assumption is that
% sample size $n$ is much less than the effective population size, $N_e$.
% Several human datasets now consist of hundreds of thousands of
% genomes~\citep{turnbull2018hundred, bycroft2018genome,
% karczewski2020mutational,tanjo2021practical,
% halldorsson2022sequences},
% and so sample size is an order of magnitude \emph{larger} than the
% usually assumed $N_e$ values.
% Agricultural datasets are an even more extreme departure from this
% assumption, with hundreds of thousands of samples embedded in
% multi-generational pedigrees~\citep{hayes20191000,Ros-Freixedes2020}
% and effective population sizes of 100 and even
% less~\citep{MacLeod2013,Makanjuola2020,Hall2016,Porcnic2016}.
% A model assuming a single $N_e$ would be a
% drastic over-simplification of course, but
% even if sufficiently complex demographic models~\citep{gower2022demes}
% encompassing hundreds of populations, explosive growth rates and myriad
% interconnections of migration, were somehow estimated and provided as input,
% ARGs sampled from the coalescent cannot capture the complexities
% of family structure in these
% datasets~\citep[e.g.][]{turnbull2018hundred,Ros-Freixedes2020}.
% Another core assumption of the coalescent model is that the genome
% (or at least the region under study) is short enough that the number of extant
% lineages remains much smaller than $N_e$ at all times.
% High-quality whole genome assemblies are now available
% for many species % Can we say something more concrete here?
% and projects are under way to obtain
% them for tens of thousands more~\citep{darwin2022sequence,lewin2022earth},
% and so we can expect inferred ARGs to routinely span
% large fractions of a chromosome.
Tremendous progress has been made in recent years on the long-standing
problem of ARG inference, there is now a range of practically
applicable methods available.
Methods targeting large-scale datasets tend to simplify
the inference problem by
making a single, deterministic
best-guess~\citep{kelleher2019inferring,speidel2019method,zhang2023biobank,zhan2023towards}
(but see \citep{deng2024robust} for recent
developments in capturing uncertainty using a Bayesian framework, for relatively small
sample sizes).
Even under strict parsimony conditions and for small sample sizes the
% Citation? This is something people have figured out, right?
number of plausible ARGs compatible with a given dataset is vast,
and it is therefore not clear that generating many guesses
when sample sizes are large will achieve much in terms of capturing uncertainty.
An alternative approach to
is to incorporate uncertainty encountered during inference into the returned ARG.
The gARG encoding described here enables particular kinds of uncertainty
to be incorporated directly into the topology:
nodes that have more than two children (polytomies)
represent uncertainty over the ordering of coalescence events
(Appendix~\ref{sec-cell-lineages-and-args}),
and those that have more than two parents
represent uncertainty over the ordering of multiple recombination events
(Appendix~\ref{sec-ARG-simplification}).
Development of other methods to capture, for example,
uncertainty about node ages and recombination breakpoint positions, is an important
aspect of future work.
How this uncertainty can be utilised in downstream applications
is an open question.
% % Do downstream applications actually make use of such full ARGs?
%Besides the inherent limitations that exist on inferring fully
%precise ARGs from data,
%we should also consider the value that such exact estimates provide
%for downstream applications.
%Many applications work by examining local trees independently,
%making detailed information about recombination events superfluous.
%For example, the \relate\ selection test~\citep{speidel2019method}
%obtains $p$-values by computing clade size probabilities conditional
%on the timing of coalescence events in a given local tree.
%In their method
%for estimating dispersal rates and the locations of genetic
%ancestors,
%\cite{osmond2021estimating} downsample trees along the genome
%so that they can be regarded as approximately independent.
%Similarly, \cite{fan2023likelihood} compute the likelihood
%of an ARG under a particular demographic model as the product
%over a sample of widely-separated local trees, assumed to be independent.
%The SIA method for detecting selection~\citep{hejase2022deep}
%encodes local trees as a set of lineage counts at discrete
%time intervals, and uses these as feature for a
%type of machine learning algorithm
%that takes ``temporal'' correlations into account.
%Thus, while SIA takes advantage of information about local tree correlation,
%it is in quite an indirect way, and
%clearly much of the detail about recombination events in an ARG is lost.
% Mentioning tsdate here doesn't advance the
% narrative I think.
% The \texttt{tsdate} algorithm and related naive estimator
% of ancestral location explicitly use the gARG encoding
% to reason about ancestral nodes~\citep{wohns2022unified}.
% The main application for fully precise ARGs thus far has been
% to compute a likelihood under the
% coalescent~\citep[e.g.][]{kuhner2000maximum,mahmoudi2022bayesian,