-
Notifications
You must be signed in to change notification settings - Fork 8
/
reviewed-paper.tex
2521 lines (2364 loc) · 126 KB
/
reviewed-paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage[letterpaper,top=2cm,bottom=2.5cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}
\usepackage[round]{natbib}
\usepackage{amsmath,amssymb,amsfonts}%
\usepackage{geometry}%
\usepackage{color}
\usepackage{graphicx}
\usepackage{authblk}
\usepackage{nameref}
\usepackage[right]{lineno}
\usepackage{subcaption}
\usepackage{tikz}
\usetikzlibrary{math,calc,positioning}
\usepackage{url}
\usepackage[symbol]{footmisc}
\usepackage[hidelinks]{hyperref}
\newcommand{\noderef}[1]{\textsf{#1}}
\newcommand{\tsinfer}[0]{\texttt{tsinfer}}
\newcommand{\kwarg}[0]{\texttt{KwARG}}
\newcommand{\argweaver}[0]{\texttt{ARGweaver}}
\newcommand{\argweaverD}[0]{\texttt{ARGweaver-D}}
\newcommand{\relate}[0]{\texttt{Relate}}
\newcommand{\espalier}[0]{\texttt{Espalier}}
\newcommand{\arbores}[0]{\texttt{Arbores}}
\begin{document}
\title{\vspace{-1.5em} \bf
A general and efficient representation of ancestral recombination graphs}
\author[1]{Yan~Wong}
\author[2,3$\star$]{Anastasia~Ignatieva}
\author[4,5$\star$]{Jere~Koskela}
\author[6]{Gregor~Gorjanc}
\author[7,8]{Anthony~W.~Wohns}
\author[1$\dagger$]{Jerome~Kelleher}
\affil[ ]{\mbox{}\vspace{-2.5em}}
\maketitle
\setlength{\skip\footins}{1em}
\setlength{\footnotemargin}{0.5em}
\renewcommand{\thefootnote}{\arabic{footnote}}
\footnotetext[1]{Big Data Institute, Li Ka Shing Centre for Health
Information and Discovery, University of Oxford, UK}
\footnotetext[2]{School of Mathematics and Statistics, University of Glasgow, UK}
\footnotetext[3]{Department of Statistics, University of Oxford, UK}
\footnotetext[4]{School of Mathematics, Statistics and Physics, Newcastle University, UK}
\footnotetext[5]{Department of Statistics, University of Warwick, UK}
\footnotetext[6]{The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK}
\footnotetext[7]{Broad Institute of MIT and Harvard, Cambridge, USA}
\footnotetext[8]{Department of Genetics, Stanford University School of Medicine, Stanford, USA}
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
\footnotetext[1]{Joint second author, listed alphabetically}
\footnotetext[2]{Correspondence: [email protected]}
\vspace{-1em}
\begin{abstract}
As a result of recombination, adjacent nucleotides can have
different paths of genetic inheritance and therefore
the genealogical trees for
% surely saying nucleotides and "sample of sequences" is concrete enough
% here? This paper isn't for a general audience, let's just say what
% we're talking about in direct language understandable by population
% geneticists.
a sample of DNA sequences vary along the genome.
The structure capturing the details of these intricately interwoven paths of
inheritance is referred to as an ancestral recombination graph (ARG).
New developments have made it possible to infer ARGs at scale,
enabling many new applications in population and statistical genetics.
This rapid progress, however, has led to a
% Controversial, perhaps, but true?
substantial gap opening between theory and practice.
Standard mathematical formalisms,
% "detailing" here is intended to convey the we keep details about these
% events like recomb breakpoint.
based on exhaustively detailing the ``events'' that occur
in the history of a sample,
are insufficient to describe the outputs of current methods.
Moreover, we argue that the underlying assumption
that all events can be known and precisely estimated
is fundamentally unsuited to the realities
of modern, population-scale datasets.
We propose an alternative mathematical formulation that
encompasses the outputs of recent methods and
can capture the full richness of modern large-scale datasets.
By defining this ARG encoding in terms of
% we haven't define a "genome" yet, so need to give a bit of context
specific genomes and their intervals of genetic inheritance,
we avoid the need to exhaustively list (and estimate)
\emph{all} events. The effects of multiple events can be
aggregated in different ways, providing a natural way to express many
forms of approximate and partial knowledge about the
recombinant ancestry of a sample.
\end{abstract}
\textbf{Keywords:} Ancestral recombination graphs
\linenumbers
\section{Introduction}
\label{sec-intro}
% Para 1: ARGs are cool!
% NOTE: slight inaccuracy of "DNA" here offset by making the problem concrete.
Estimating the genetic genealogy of a set of DNA sequences
under the influence of recombination,
usually known as an Ancestral Recombination Graph (ARG), is a long-standing
goal in genetics.
Broadly speaking, an ARG describes the different paths of genetic inheritance
caused by recombination, encapsulating the resulting complex web of genetic
ancestry
(see \citet{lewanski2023era} for a biologically oriented introduction).
Recent breakthroughs
in large-scale inference
methods~\citep{rasmussen2014genome,kelleher2019inferring,speidel2019method,
schaefer2021ancestral,wohns2022unified,zhang2023biobank,zhan2023towards}
have raised the realistic prospect of ARG-based analysis becoming a standard part
of the population and statistical genetics toolkit~\citep{hejase2020summary}.
Applications using inferred ARGs as input have begun to
appear~\citep{osmond2021estimating,
fan2022genealogical,
hejase2022deep,
guo2022recombination,
zhang2023biobank,
nowbandegani2023extremely,
ignatieva2023distribution,
fan2023likelihood}
and many more are sure to
follow~\citep{harris2019database,harris2023using}.
% Para 2: What are they, though?
Although it is widely accepted that ARGs are important, there is some
confusion about what, precisely, an ARG \emph{is}.
% The grammar is a nightmare here - the stochastic process is singular, right?
In its original form,
developed by Griffiths and colleagues,
the ARG is an alternative
formulation of the coalescent with recombination~\citep{hudson1983properties},
where the stochastic process of coalescence and recombination
among ancestral lineages is formalised
as a graph~\citep{griffiths1991two,
ethier1990two,griffiths1996ancestral,griffiths1997ancestral}.
% Move into plural, an ARG is an instance of the data structure in this view
Subsequently, an ARG has come to be thought of as a data
structure~\citep{minichiello2006mapping}, i.e.\ describing
a \emph{realisation} of such a random process,
or an inferred ancestry of a sample of genomes.
The distinction between stochastic process
and data structure is not clear cut, however, and subfields use the term
differently (Appendix~\ref{sec-arg-history}).
Other subtly different concepts that we must be careful to distinguish
are the ``true'' ARG, % I don't see a better word for "true" here
describing the actual history of a sample of genomes,
and a ``population'' ARG which is
the true ARG of every individual in a population.
True ARGs are the underlying real history of a sample,
perfectly resolved into binary splits and mergers by
the cellular processes of meiosis and mitosis, regardless
of sampling density or population processes
(Appendix~\ref{sec-cell-lineages-and-args}).
Although population-scale true ARGs unquestionably exist,
they can also never be entirely known, in part because of a fundamental
lack of mutational information.
Even if mutation rate were high enough to uniquely identify every
recent branching point,
such a high rate would saturate the genome with mutations
and obscure deeper history.
% Population ARGs are here, now.
Population ARGs may seem fanciful, but the scale of
modern datasets makes it necessary for us to grapple with the idea.
The UK Biobank (UKB), for example, has
genotype data for around 500,000 humans~\citep{bycroft2018genome},
along with exome~\citep{backman2021exome} and
whole genome sequence~\citep{halldorsson2022sequences} data
for large subsets of the cohort.
UKB is just one of many such population-scale sequencing
projects~\citep[e.g.][]{turnbull2018hundred,
karczewski2020mutational,tanjo2021practical}.
Agricultural datasets are on a similar scale,
and also include dense multi-generational sampling
and near-perfect pedigree
information~\citep[e.g.][]{hayes20191000,Ros-Freixedes2020}.
Recent advances have made it possible to actually
\emph{estimate} ARGs at population scale:
ARGs have already been inferred for the 500,000
humans in UKB~\citep{kelleher2019inferring,zhang2023biobank}
and over a million SARS-CoV-2 genomes~\citep{zhan2023towards}.
While this new population-scale reality presents many
exciting opportunities, it also poses substantial challenges
to existing methodologies.
% What's the problem?
A major problem currently facing the field is that
classical mathematical formalisms and terminology
cannot adequately describe these vast inferred ARGs.
Fundamentally, these formalisms assume that an
ARG is known in complete detail
and are not suited to describing partial or approximate knowledge.
As we are actively inferring ARGs at the population scale,
and such ARGs can never be known in complete detail,
there is currently a substantial gap between our theoretical frameworks
and practical application.
The breakthroughs in scale achieved by recent
methods~\citep[e.g.][]{kelleher2019inferring,speidel2019method,zhang2023biobank}
are all based, in different ways, on inferring
approximate \emph{structures}
instead of a complete and fully detailed history.
% This is a digression from the maing point of the paragraph, so
% making it a parenthetical aside.
(Note that it is important to distinguish here between
structures and models: whether an inference method is based
on heuristics or a rigorous mathematical model is orthogonal
to the level of detail provided in its estimate.
One could heuristically estimate a fully precise ARG,
% this would be a cool thing to do!
or statistically sample a partial,
approximate ARG under a model
such as the coalescent.)
Although the term ``ARG'' is now often used in a general
sense~\citep[e.g.][]{mathieson2020ancestry,hejase2020summary,
schaefer2021ancestral,harris2023using,zhang2023biobank,fan2023likelihood},
informally
encompassing the varied approximate structures output by modern simulation and
inference methods~\citep{rasmussen2014genome, palamara2016argon, haller2018tree,
kelleher2019inferring, speidel2019method, baumdicker2021efficient,
zhang2023biobank},
there is no corresponding mathematical definition
that is sufficiently general.
We address this problem by providing a simple formal definition of an
ARG data structure, based on recording the intervals of
genetic inheritance between specific genomes.
We call this the ``genome ARG'', or gARG encoding. We
contrast this with the classical formal definition of an ARG, based on
recording common ancestor and recombination events, which we refer to as the
``event ARG'' or eARG encoding. We show that the new gARG encoding is a
substantial generalisation of the classical eARG approach, providing much more
flexibility in how genetic inheritance can be represented, and encompasses the
outputs of modern methods. We show that the gARG approach can represent many
different types of approximation, in particular allowing us to systematically
describe uncertainty about the temporal ordering of multiple recombinations. It
is important to note that throughout we are interested in the details of these
competing mathematical formulations and their practical consequences.
% True ARGs, being perfectly resolved, can ultimately be described in
% either formalism
% This is nice to have, but probably won't work in the final version
% without section numbers. Let's deal with that when we get there I guess.
We begin in Section~\ref{sec-gARG} by providing a precise formal definition
of a gARG, illustrated by an example ARG embedded in pedigree.
We then provide a similar definition of the classical eARG
approach in Section~\ref{sec-eARG}, and consider its limitations
in the context of current datasets and research questions.
Following this, we discuss the important concept of ancestral material
in Section~\ref{sec-ancestral-material}, and how it relates
to the process of converting an eARG to a gARG.
We continue
in Section~\ref{sec-ARG-and-local-trees} by considering the relationship
between an ARG and its local trees. Contrary to the
prevailing view, we show that a suitably encoded sequence of local trees contains
precisely as much information as the corresponding ARG.
The gARG encoding opens a rich new set of details about ARGs,
including the ideas of locally unary nodes
(Section~\ref{sec-locally-unary-edges}),
the levels of detail that can be represented in an ARG
(Section~\ref{sec-ARG-simplification}),
and the degrees of precision about recombination
that can be stored and we may seek to infer (Section~\ref{sec-precision}).
These ideas have important practical considerations,
which we illustrate by examining the qualitative properties of
ARGs inferred by four recent methods for a classical benchmark
dataset in Section~\ref{sec-example-inferred-args}.
We then discuss how the gARG framework can be
efficiently implemented in Section~\ref{sec-efficiency},
and finish with an assessment of the key challenges facing
the field in the Discussion.
Finally, the literature on ARGs is large and
confusing, and we attempt to clarify some important aspects
in appendices, including
a brief history of ancestral graphs (Appendix~\ref{sec-arg-history}),
a description of the Big and Little ARG stochastic processes
(Appendix~\ref{sec-big-and-little-arg}),
a survey of ARG inference methods (Appendix~\ref{sec-survey-arg-infer}),
and a discussion of ARGs at an individual vs cell
lineage level (Appendix~\ref{sec-cell-lineages-and-args}).
\section{Genome ARGs}
\label{sec-gARG}
We define a genome as the complete set of genetic material that a child
inherits from one parent. A diploid individual
% Q: will someone nitpick that we're assuming diploid obligate sexual species?
therefore carries two genomes, one inherited from each parent (we assume diploids here
for clarity, but the definitions apply to organisms of arbitrary ploidy).
We will also use the term ``genome'' in its
more common sense of ``the genome'' of a species,
and hope that the distinction will be clear from the context.
We are not concerned here with mutational processes or observed sequences,
but consider only processes of inheritance,
following the standard practice in coalescent theory.
We also do not consider structural variation, and assume that all
samples and ancestors share the same genome coordinate space.
A genome ARG (gARG) is a directed acyclic graph in which nodes represent
haploid genomes and edges represent
genetic inheritance between an ancestor and a descendant.
The topology of a gARG specifies that genetic inheritance
occurred between particular
ancestors and descendants, but the graph connectivity
does not tell us which \emph{parts} of their genomes were inherited.
In order to capture the effects of recombination
we ``annotate'' the edges with the genome
coordinates over which inheritance occurred.
This is sufficient to describe the effects of inheritance under
any form of homologous recombination (such as multiple crossovers,
gene conversion events, and many forms of bacterial and viral recombination).
We can define a gARG formally as follows.
Let $N = \{1, \dots, n\}$ be the set of nodes representing
the genomes in the gARG,
and $S \subseteq N$ be the set of sampled genomes.
Then, $E$ is the set of edges, where each element
is a tuple $(c, p, I)$ such that $c, p \in N$ are the child and
parent nodes and $I$ is the set of disjoint genomic intervals
over which genome $c$ inherits from $p$.
Thus, each topological connection between
a parent and child node in the graph is annotated with a set of
inheritance intervals $I$.
Here, the terms parent and child are used in the graph sense;
these nodes respectively represent ancestor and descendant genomes,
which can be separated by multiple generations.
We will use these two sets of terms interchangeably.
How nodes are interpreted, exactly, is application dependent.
Following \citet{hudson1983properties}, we can view nodes
as representing gametes, or we can imagine them representing,
for example, the genomes present in cells immediately before or after
some instantaneous event (Appendix~\ref{sec-cell-lineages-and-args}).
A node can represent any genome along a chain of cell divisions
or can be interpreted as representing one of the genomes of a
potentially long-lived individual. All these interpretations
are potentially useful, and equally valid under the assumptions
of the gARG encoding.
In many settings, nodes are dated, i.e.\ each
node $u\in N$ is associated with a time $\tau_u$,
and how we assign precise times will vary by application.
The topological ordering defined by the directed graph structure
and an arrow of time (telling us which direction is pastwards)
is sufficient for many applications, however,
and we assume node dates are not known here.
% This should be obvious, but seems like it needs to be said
In practical settings, we will wish to associate additional
metadata with nodes such as sample identifiers or quality-control metrics.
It is therefore best to think of the
integers used here in the definition of a node as an \emph{identifier},
with which arbitrary additional information can be associated.
\begin{figure}
\begin{center}
\includegraphics[width=\textwidth]{illustrations/arg-in-pedigree}
\end{center}
\caption{\label{fig-arg-in-pedigree}
An example genome ARG (gARG) embedded in a pedigree.
(A) Diploid individuals (blue), visualised in a highly inbred pedigree and
labelled $D_1$ to $D_8$,
contain both paternal and maternal genomes
labelled \noderef{a} to \noderef{p}. Black lines show inheritance paths connecting
genomes in the current generation (\noderef{a} to \noderef{d}) with their ancestors.
Genomes \noderef{a} and \noderef{c} are the product of two independent
meioses (recombination events, in red) between
the paternal genomes \noderef{e}
and \noderef{f}, and regions of genome inherited are shown with shaded colour.
Genomes are shaded such that where, backwards in time,
they merge into a common ancestor, the merged region is darker.
(B) The corresponding gARG along with inheritance annotations on all edges
(partial inheritance in bold).
(C) The corresponding local trees.
}
\end{figure}
As illustrated in Fig.~\ref{fig-arg-in-pedigree},
the gARG for a given set of individuals is embedded in their pedigree.
The figure shows the pedigree of eight diploid individuals
and their sixteen constituent genomes (each consisting of a single chromosome),
along with paths of genetic inheritance.
Here, and throughout,
nodes are labelled with lowercase alphabetical letters
rather than integer identifiers to avoid confusion with genomic intervals.
Thus individual $D_1$ is composed
of genomes \noderef{a} and \noderef{b}, which are inherited from its
two parents $D_3$ and $D_4$. Each inherited genome may be the recombined product
of the two genomes belonging to an individual parent.
In this example,
genome \noderef{b} was inherited directly from $D_4$'s genome \noderef{g} without
recombination, whereas
genome \noderef{a} is the recombinant product of
$D_2$'s genomes \noderef{e} and \noderef{f} crossing over at position 2.
Specifically, genome \noderef{a} inherited the (half-closed)
interval $[0, 2)$ from genome \noderef{e} and $[2, 10)$ from genome \noderef{f}.
These intervals are shown attached to the corresponding graph edges.
The figure shows the annotated pedigree with realised inheritance of genomes
between generations (A), the corresponding gARG (B), and finally the corresponding
sequence of local trees along the
genome (C).
The local trees span the three genome regions delineated
by the two recombination breakpoints that gave rise to these genomes;
see Section~\ref{sec-ARG-and-local-trees} for details
on how local trees are embedded in an ARG.
% % I feel like sophisticated readers would at this point be saying,
% % ah, yeah, that's basically what an ARG is just in different words;
% % what's the point of this paper?
The genome ARG framework defined here is
in many ways simply a clarification of existing treatments
\cite[e.g.][]{mathieson2020ancestry,shipilina2023origin},
adding concrete details to describe the
differential inheritance of genetic material between genomes.
% % This is definitely needed for the Nicks, who are always thinking
% % about the deeper processes and are quite happy being loosey-goosey
% % with representation details
It is important to note that here, and throughout,
we are not questioning the form of the
actual ancestral processes that occur in nature, but rather how we
\emph{represent} the outcomes of such processes in a practical manner.
% Again, need to remind the reader that there's an actual point
% to all this getting stuck into the details.
These practical details,
as demonstrated in later sections, have important
consequences not only for how methods exchange information about
simulated and inferred ARGs, but more fundamentally in
how we set our goals for inference and evaluate the success of results.
\section{Event ARGs}
\label{sec-eARG}
% Trying to be totally, boringly, abundantly, blatantly clear here what we're
% talking about.
In this section we define the classical view of an ARG data structure,
and illustrate its limitations. We are
interested in the details of how ARGs are described mathematically,
and as a consequence, how they are represented in a practical sense
as the output of inference programs.
Where details of an ARG data structure (the encoding) are
% Three papers 10 years apart will do here, right?
provided~\citep[e.g.][]{
wiuf1999recombination,gusfield2014recombinatorics,hayman2023recoverability}
they follow the approach described by Griffiths and colleagues
(but see~\citet{parida2011minimal} and \citet{zhang2023biobank}
for notable exceptions), and a large number of
% Note the "an" here to avoid cases like ARGweaver
ARG inference methods use it as an output format
\citep[e.g.][]{song2004minimum,song2005efficient,rasmussen2014genome,
heine2018bridging,ignatieva2021kwarg}.
% Griffiths never mentions "sampling events", so let's not get
% into that. It's not an important detail, and we don't want to be
% overly precise about something that people haven't actually written
% down.
In this Griffiths encoding we have two types of internal node in the graph,
representing the common ancestor and recombination events
in the history of a sample.
At common ancestor nodes, the inbound lineages merge into a
single ancestral lineage with one parent, and at recombination
nodes a single lineage is split into two independent
ancestral lineages. Recombination nodes are annotated with
the corresponding crossover breakpoints, and these breakpoints
are used to construct the local trees.
This is done by tracing pastwards through the graph from the samples,
making decisions about which outbound edge to follow through
recombination nodes based on the breakpoint
position~\citep{griffiths1996ancestral}.
Because it is focused on recording events and their properties,
we will refer to this Griffiths encoding as the ``event ARG'' or eARG
encoding. Fig.~\ref{fig-event-arg} shows an example of a classical
eARG with three sample genomes (\noderef{a}, \noderef{b}, and \noderef{c}),
three common ancestor events (\noderef{e}, \noderef{f}, and \noderef{g})
and a single recombination event (node \noderef{d}) with a breakpoint
at position $x$.
Assigning a breakpoint to a recombination node is
not sufficient to uniquely define the local trees, and either
some additional ordering rules~\citep[e.g.][]{griffiths1996ancestral} or
explicit information~\citep[e.g.][]{gusfield2014recombinatorics,ignatieva2021kwarg}
is required to distinguish the left and right parents.
We assume in Fig.~\ref{fig-event-arg} that \noderef{d} inherits genetic material to the
left of $x$ from \noderef{e} and to the right of $x$ from \noderef{f}.
\begin{figure}
\centering
\tikzmath{\x1 =0; \x2=8;\xx=12.5; \x3=14; \xt=18;}
\begin{tikzpicture}[x=5mm, y=5mm, node distance=2mm and 20mm]
\tikzset{greynode/.style={circle,fill,inner sep=1},
nodelabel/.style={font=\footnotesize}}
\node [anchor=north west] at (\x1,6) {A};
\node [anchor=north west] at (\x2,6) {B};
% \node [anchor=north west] at (\xt,6) {C};
%%% (A) ARG
\node (s0) [greynode] at (\x1 + 0, 0) {};
\node (s1) [greynode] at (\x1 + 3, 0) {};
\node (s2) [greynode] at (\x1 + 6, 0) {};
\node (s3) [greynode] at (\x1 + 3, 1) {};
\node (s4) [greynode] at (\x1 + 1, 2) {};
\node (s5) [greynode] at (\x1 + 5, 3) {};
\node (s6) [greynode] at (\x1 + 3, 4) {};
\draw (s1) -- (s3);
\draw (s0) |- (s4);
\draw (s4) -- (\x1 + 2,2) |- (s3);
\draw (s4) |- (s6);
\draw (s3) -- (\x1 + 4,1) |- (s5);
\draw (s2) |- (s5);
\draw (s5) |- (s6);
%%% (B) Trees
\node (l0) [greynode] at (\x2 + 0, 0) {};
\node (l1) [greynode] at (\x2 + 2, 0) {};
\node (l2) [greynode] at (\x2 + 3, 0) {};
\node (l3) [greynode] at (\x2 + 1, 2) {};
\node (l4) [greynode] at (\x2 + 2, 4) {};
\draw (l0) |- (l3);
\draw (l1) |- (l3);
\draw (l2) |- (l4);
\draw (l3) |- (l4);
\node (r0) [greynode] at (\x3 + 0, 0) {};
\node (r1) [greynode] at (\x3 + 1, 0) {};
\node (r2) [greynode] at (\x3 + 3, 0) {};
\node (r3) [greynode] at (\x3 + 2, 3) {};
\node (r4) [greynode] at (\x3 + 1, 4) {};
\draw (r0) |- (r4);
\draw (r1) |- (r3);
\draw (r2) |- (r3);
\draw (r3) |- (r4);
\foreach \u/\lab in {
s0/$\noderef{a}$, s1/$\noderef{b}$, s2/$\noderef{c}$,
r0/$\noderef{a}$, r1/$\noderef{b}$, r2/$\noderef{c}$,
l0/$\noderef{a}$, l1/$\noderef{b}$, l2/$\noderef{c}$}
\node[nodelabel,anchor=south] at ([yshift=-12pt]\u) {\lab};
\foreach \u/\lab in {
s6/$\noderef{g}$,
l4/$\noderef{g}$,
r4/$\noderef{g}$}
\node[nodelabel,anchor=south] at (\u) {\lab};
\node [nodelabel,anchor=north west] at ($(s3) + (\x1 + 0,0)$) {$x$};
\foreach \u/\lab in {
s4/$\noderef{e}$,
l3/$\noderef{e}$}
\node[nodelabel,anchor=south west] at (\u) {\lab};
\foreach \u/\lab in {
s5/$\noderef{f}$,
r3/$\noderef{f}$}
\node[nodelabel,anchor=south east] at (\u) {\lab};
\foreach \u/\lab in {s3/$\noderef{d}$, s6/$\noderef{g}$}
\node[nodelabel,anchor=south] at (\u) {\lab};
\draw[dashed] (\xx,0) -- (\xx, 4);
\node[nodelabel,anchor=north] at (\xx,0) {$x$};
% %%% (C) Encoding
% \node [nodelabel,anchor=north west] at ($(\xt,5)$) {
% \begin{tabular}{c|c|l}
% % \multicolumn{2}{c}{Breakpoints}\\
% Node & Breakpoint & Parents\\
% \hline
% $\noderef{a}$ & $\varnothing$ & [\noderef{e}]\\
% $\noderef{b}$ & $\varnothing$ & [\noderef{d}] \\
% $\noderef{c}$ & $\varnothing$ & [\noderef{f}]\\
% $\noderef{d}$ & $x$ & [\noderef{e}, \noderef{f}] \\
% $\noderef{e}$ & $\varnothing$ & [\noderef{g}]\\
% $\noderef{f}$ & $\varnothing$ & [\noderef{g}]\\
% $\noderef{g}$ & $\varnothing$ & []\\
% \end{tabular}};
\end{tikzpicture}
\caption{\label{fig-event-arg}
A classical event ARG (eARG). (A) Standard graph depiction with
breakpoint $x$ associated with the recombination node \noderef{d}.
Nodes \noderef{e}, \noderef{f} and \noderef{g} are common ancestor events.
(B) Corresponding local trees to the left and right of breakpoint $x$
(note these are shown in the conventional form in which only coalescences
within the local tree are included; see Section~\ref{sec-ARG-and-local-trees}
for a discussion of this important point).
}
\end{figure}
% eARGs are representationally for limited interchange
While the Griffiths approach of annotating recombination nodes with a
breakpoint in an eARG is a concise and elegant way of describing realisations
of the coalescent, it is inherently limited when
implemented literally. The eARG encoding explicitly models only
two different types of event and thus anything that is not a single crossover
recombination or common ancestor event, must be incorporated
either in a roundabout way using these
events, or by adding new types of event to the encoding. For example, gene
conversion could be accommodated either by stipulating a third type of event
(annotated by two breakpoints and corresponding traversal conventions for
recovering the local trees) or by two recombination nodes joined by a
zero-length edge. From the perspective of practical interchange of data between
inference methods and downstream applications, both
workarounds are problematic, and the gARG encoding described in the previous
section offers a much simpler solution.
Aside from these obvious practical challenges arising from a
literal implementation of the Griffiths approach, there is a deeper
issue with the implicit strategy of basing an ARG data structure on
recording events and their properties (e.g.\ the crossover breakpoint
for a recombination event). The fundamental problem
is that this approach assumes all events are \emph{knowable}, and does not
provide any obvious mechanism for either aggregating multiple events
or expressing uncertainty about them. While this is not a
problem when describing the results of simulations (where all details
are perfectly known), it is a major issue when we wish to
formally describe the output of inference methods, particularly
as datasets approach the population scale.
As discussed in the introduction,
the precise details of all events in these vast ARGs can never be
known, and a data structure that \emph{enforces} complete
precision is therefore an impediment to progress.
There is also a certain clarity gained by explicitly modelling nodes
in the inheritance graph as genomes.
Outside of the context of a
% The point being, events are perfectly well defined in the models
% but not in reality
mathematical model, an ``event'' is a slippery concept.
For example, \emph{which} genome along a chain of cell divisions should be
regarded as the one where an event occurred,
or whether multiple coalescences
within a single individual should be regarded as one or multiple events are
debatable points (Appendix~\ref{sec-cell-lineages-and-args}).
% If we want to have an ARG software ecosystem then such fireside
% discussions won't help.
From the perspective of a concrete data structure,
ideally forming the basis of an ecosystem of interoperable
inference and analysis methods, such debates
are unproductive.
\section{Ancestral material and sample resolution}
\label{sec-ancestral-material}
Ancestral material~\citep{wiuf1999ancestry,wiuf1999recombination}
is a key concept in understanding the overall inheritance structure
of an ARG (here, and throughout, we use the general term ``ARG'' when
the details of the specific encoding are not important).
It denotes the genomic intervals ancestral to a set of samples
on the edges of an ARG.
For example, in Fig.~\ref{fig-arg-in-pedigree} we have
four sample genomes, \noderef{a}--\noderef{d}. As we
trace their genetic ancestry into the previous generation
(\noderef{e}--\noderef{h}), we can think of their ancestral
material propagating through the graph
backwards in time. In the region $[2, 7)$, there is a
local coalescence where nodes \noderef{a} and \noderef{c}
find a common ancestor in \noderef{f}. Thus, in this region,
the total number of genome segments that are ancestral to the
sample is reduced from four to three. Fig.~\ref{fig-arg-in-pedigree}A
illustrates this by (shaded) ancestral material being present
in only three nodes (\noderef{f}, \noderef{g}, and \noderef{h}) in this region,
while node \noderef{e} is blank
as it carries \emph{non-ancestral} material.
This process of local coalescence continues through the
graph, until all samples reach their most recent common
ancestor in node \noderef{n}.
The process of tracking local coalescences and updating
segments of ancestral material is a core element of
Hudson's seminal simulation
algorithm~\citep{hudson1983testing,kelleher2016efficient},
and the key distinguishing feature between the
``Big'' and ``Little'' ARG stochastic processes
(see Appendix~\ref{sec-big-and-little-arg}).
The ability to \emph{store} resolved ancestral material
is also a key distinction between the eARG and gARG
encodings. Because an eARG stores only the graph topology and
recombination breakpoints, there is no way to locally
ascertain ancestral material without traversing the graph
pastwards from the sample nodes,
resolving the effects of recombination and common ancestor events.
Efficiently propagating and resolving ancestral material for
a sample through a pre-existing graph is a well-studied problem,
and central to recent advances in individual-based forward-time
simulations~\citep{kelleher2018efficient,haller2018tree}.
In contrast to the usual ``retrospective'' view of ARGs
discussed so far, these methods record an ARG forwards in
time in a ``prospective'' manner. Genetic inheritance relationships
and mutations are recorded exhaustively, generation-by-generation,
leading to a rapid build-up of information, much of which
will not be relevant to the genetic ancestry of the current population.
This redundancy is periodically removed using the ``simplify''
algorithm~\citep{kelleher2018efficient}, which propagates and
resolves ancestral material.
Efficient simplification is the key enabling factor for
this prospective-ARG based approach to forward-time simulation,
which can be orders of magnitude faster than standard
sequence-based methods
(see Section~\ref{sec-ARG-simplification} for
other applications of ARG simplification).
% Define "sample-resolved" here for later use
We refer to a gARG that has been simplified with respect to a set of
samples, such that the inheritance annotations on its edges contain
no non-ancestral material, as sample-resolved.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{illustrations/ancestry-resolution}
\caption{\label{fig-ancestry-resolution}
Converting the \citet[][Fig.~1]{wiuf1999recombination} example
to a sample-resolved gARG. (A) The original eARG; square nodes
represent sampling (black), common ancestor (blue), and recombination (red) events;
the latter contain breakpoint positions.
(B) The corresponding gARG with breakpoints directly converted to
edges annotated with inheritance intervals.
(C) The sample-resolved gARG resulting from simplifying with respect
to the sample genomes, \noderef{a}, \noderef{b}, and \noderef{c}.
Dashed lines show edges that are
no longer present (in practice, nodes \noderef{g}, \noderef{j}, and \noderef{q} would also be removed).
Coalescence with respect to the sample is indicated by shaded bars, as
in Fig.~\ref{fig-arg-in-pedigree}A; nodes \noderef{n}, \noderef{o}, \noderef{p}, \noderef{q} have truncated
bars showing that local ancestry of entirely coalesced regions is omitted.
Line thickness is proportional to the genomic span of each edge.
Nodes representing recombination events are retained
for clarity, but could be removed by simplification if
desired.
}
\end{figure}
Any eARG can be converted to a sample-resolved gARG
via a two-step process illustrated in Fig.~\ref{fig-ancestry-resolution}.
The first step is to take the input eARG (Fig.~\ref{fig-ancestry-resolution}A),
duplicate its graph topology, and then add inheritance annotations
to each of the gARG's edges (Fig.~\ref{fig-ancestry-resolution}B) as follows.
If a given node is a common ancestor event, we annotate the single
outbound edge with the interval $[0,L)$, for a genome of length $L$. If the
node is a recombination event with a breakpoint $x$, we annotate the two
outbound edges respectively with the intervals $[0, x)$ and $[x, L)$. These
inheritance interval annotations are clearly in one-to-one correspondence with
the information in the input eARG. They are also analogous to the
inheritance intervals we get on the edges in a prospective gARG
produced by a forward-time simulation, which are concerned with recording
the direct genetic relationship between a parent and child genome and are not
necessarily minimal in terms of the resolved ancestral material of a sample.
Thus, the final step is to use the ``simplify'' algorithm to perform the required
sample resolution (Fig.~\ref{fig-ancestry-resolution}C).
The sample-resolved gARG of Fig.~\ref{fig-ancestry-resolution}C
differs in some important ways to the
original eARG (Fig.~\ref{fig-ancestry-resolution}A).
Firstly, we can see that some nodes and edges have been removed entirely
from the graph.
The ``grand MRCA'' \noderef{q} is omitted from the
sample-resolved gARG because all segments of the genome have
fully coalesced in \noderef{k} and \noderef{p} before \noderef{q} is reached.
Likewise, the edge
between \noderef{g} and \noderef{j} is omitted because the recombination
event at position $5$ (represented by node \noderef{g})
fell in non-ancestral material.
More generally, we can see that the sample resolved
gARG of Fig.~\ref{fig-ancestry-resolution}C
allows for ``local'' inspection
of an ARG in a way that is not possible in an eARG.
Because the ancestral material is stored with each edge of a gARG, the
cumulative effects of events over time can be reasoned
about, without first ``replaying'' those events.
Many computations
that we wish to perform on an ARG will require resolving
the ancestral material with respect to a sample.
% this is weak, but would be good to emphasise this point.
% you just end up running simplify lots of times, in your
% downstream algs
The gARG encoding
allows us to perform this once
and to store the result,
whereas the eARG encoding requires us to repeat the process
each time.
Note that the \citet{wiuf1999recombination} eARG
in Fig.~\ref{fig-ancestry-resolution} is not particularly
representative, because inference or simulation methods usually
only generate ARGs containing nodes and edges ancestral to the sample
(but see the discussion of the ``Big ARG'' stochastic process in
Appendix~\ref{sec-big-and-little-arg}).
Nonetheless, it is an instructive example from the literature which highlights several
important properties of ARGs, and the general point about
the need to resolve ancestral material ``on the fly'' for eARG traversals
holds.
\section{ARGs and local trees}
\label{sec-ARG-and-local-trees}
% (We use ``local'' rather than
% ``marginal''
% % could list loads here, but not much point.
% trees~\citep[e.g.][]{griffiths1996ancestral,minichiello2006mapping,kelleher2016efficient}
% to avoid the potential confusion with the
% statistical concept of marginal distributions.)
The relationship between an ARG and its corresponding
local trees is subtle and important.
A fundamental property of genetics is that a
given DNA nucleotide is inherited from exactly one parent genome,
both at an organismal and cell-by-cell level
(Appendix~\ref{sec-cell-lineages-and-args}).
These paths of single-parent inheritance give rise,
by definition, to a tree structure.
As a result of recombination, adjacent nucleotides can have
different paths of inheritance, and
an ARG encodes the entire ensemble of local trees along the
% Said "sample of genomes" originally, but using term "genome" in two
% different ways in the same sentence then.
genome for a given set of sample nodes.
% This is for Yan - to me this is a given, but maybe some people feel
% the following is overly formal?
Precisely defining the process by which local trees are extracted
from an ARG is essential to our understanding of how ARGs and local trees
are related, and we require a concrete mathematical structure
to describe the local trees. It is important to note that
although the following discussion is phrased in terms of the
gARG encoding, the arguments apply equally to eARGs
because any eARG can be converted to a
% Note, being sample-resolve is not necessary for following argument so omitting
gARG without loss of information (Section~\ref{sec-ancestral-material}).
Oriented trees provide a
convenient formalism to capture these parent-child relationships
in a well-defined combinatorial object.
% There's no need for a comma
Let $\pi_1\dots\pi_n$ be a sequence of integers, such that $\pi_u$
denotes the parent of node $u$, and $\pi_u = 0$ if $u$ is a
root~\cite[p.\ 461]{knuth11combinatorial}.
This encoding is particularly useful to describe
evolutionary trees because parent-child relationships are
important but the ordering of children at a node is
not~\citep{kelleher2013coalescent,kelleher2014coalescent,
kelleher2016efficient}.
Thus, for a given gARG with nodes $\{1, \dots, n\}$ and
edges $E$ (Section~\ref{sec-gARG}), we recover the local tree
at position $x$ as follows.
We begin by setting $\pi_u = 0$ for each $1 \leq u \leq n$.
Then, for each sample node in $S$ we trace its path pastwards through the
ARG for position $x$, and record this path in $\pi$.
Specifically, at a given node $u$,
we find an edge $(c, p, I) \in E$ such that $u = c$ and $x \in I$, and set
$\pi_c \leftarrow p$. We then set $u \leftarrow p$, and repeat
until either $\pi_u \neq 0$ (indicating we have traversed this section
of the ARG already on the path from another sample) or there
is no matching outbound edge (indicating we are at a root).
Note that the local trees for an ARG are ``sparse''~\citep{kelleher2016efficient},
because many ancestral nodes will not be reachable from the
samples at a given position (so their corresponding entries in $\pi$ will be zero).
This combinatorial approach provides at least
one novel insight, clarifying the fundamental relationship between
ARGs and local trees.
Suppose we are given a gARG defined by a set of nodes and edges.
There is no requirement on the structure of this ARG beyond
the basic definitions: it could correspond to an ARG in which
every recombination event is exactly specified
(e.g.\ Fig.~\ref{fig-ancestry-resolution})
or one in which local trees are entirely disjoint
(i.e.\ only the sample nodes are shared between them).
If we are given the sequence of local trees for this gARG
encoded as an oriented tree, along with the genome interval
covered by each tree, we can recover the original gARG exactly.
% We do this site-by-site for simplicity, but hopefully it's obvious
% that this isn't necessary?
More formally, suppose we are given the local tree $\pi^x_1\dots\pi^x_n$
for each nucleotide position $1 \leq x \leq L$ on a genome of length $L$.
Then, the edges of the ``local ARG'' for this tree is given by
$E^x = \{(u, \pi_u^x, \{x\}) \mid \pi_u^x \neq 0\}$. Because the ARG
edges are defined by $(c, p, I)$ tuples, where the set $I$ defines
the positions over which node $c$ inherits from parent $p$, we can
then simply combine the ``local ARGs'' for each position $x$
% YW: The above "combining" operation isn't obviously lossless to me:
% we could potentially lose information in the different combinations
% of intervals (e.g. if we use 2 non-combined intervals to represent a
% diamond in a 1-RE-node ARG). It seems like we might need an additional
% gARG spec to define what combinations of intervals are equivalent.
% But no-one else has commented on this, so I guess we just let it lie.
to recover precisely the same set of edges as the original ARG.
Thus, under this definition, there
is a one-to-one correspondence between an ARG and
the sequence of local trees that it encodes.
This is not the prevailing view, however.
\cite{kuhner2017consensus} argue that the
``interval-tree'' representation of
an ARG (the local trees and the genome intervals they cover)
``does not contain all of the information in the underlying ARG: it lacks the
number of recombinations occurring at each site, the times at which
recombinations occurred, and the specific sequences involved as recombination
partners.''
\cite{shipilina2023origin} discuss the same ideas,
and note that the
``full ARG...~contains more information than the series of tree
sequences along the genome''.
These
% % Could cite some more? E.g. KwARG paper says
% % "We note that ARGs contain more information than local trees"
% % Would need a few more examples though, which would require some digging.
%
% and similar, \citep[e.g.][]{}
statements that an ARG
contains more information than its local trees are true
if we represent local trees in their conventional forms,
but these forms discard important information that is
available in an ARG.
There are two properties of how evolutionary trees are conventionally
represented that lead to this
disagreement about the relationship between local trees and an ARG.
Firstly, the internal nodes of evolutionary trees are usually considered to be
\emph{unlabelled}, or equivalently, labelled by the leaves which they subtend.
The same canonical labelling cannot be used for internal ARG nodes because the
leaves they subtend will typically vary by genomic position. If we do not label
the tree nodes in a way which is persistent across the sequence of local trees
in the ARG, we lose the fact that the \emph{same} ancestors sometimes persist
across multiple trees.
Defining ARG nodes as integers and using the oriented
tree encoding explicitly labels internal nodes, and makes the relationship
between tree and ARG nodes clear and precise.
% Note: good to keep to paras here to logically separate the two important
% points.
The second property of how evolutionary trees
are conventionally represented that is unhelpful in the context of ARGs is their
focus on branching points (coalescences), i.e.\ nodes that have two or more children.
As the introductory paragraph of this section emphasised,
parent-child relationships are what fundamentally define a tree,
and branching points can be seen as incidental. This is reflected
by the oriented tree encoding which simply stores the local
parent-child relationships, and does not, for example,
directly tell us how many children a particular node has.
The local tree at a given position records the \emph{path} through
the ARG; if this path omits nodes that are not
branching points (such as \noderef{e} in Fig.~\ref{fig-arg-in-pedigree}),
we lose information about the ARG.
We return to this point in the following two sections,
when we discuss ``locally unary'' nodes and the simplification process.
It is important that we make the distinction here between the local
trees that we can derive from a known ARG (as just discussed),
and an ARG that we can derive from a sequence of \emph{estimated}
local trees.
The ARG inference method
\espalier~\citep{rasmussen2022espalier} is illustrative in this context.
It begins by splitting an input sequence alignment into
segments that are assumed to be non-recombining. Within
each segment, an initial local tree is estimated using
standard phylogenetic methods. By necessity, these local
trees will contain internal nodes that are unlabelled and
consist only of branching points: there is no information
shared between the independent tree estimation steps
across segments. Part of the task of stitching
these trees together
into an ARG is then, essentially, to generate labels for
the internal nodes, and decide which nodes persist
across multiple local trees.
\espalier\ approaches this task
by identifying maximal subtrees
that do not change between pairs of adjacent
local trees and then
heuristically exploring the space of possible
rearrangements of these subtrees.
To derive details about recombination events,
\espalier\ then attempts to infer
the precise subtree prune-and-regraft (SPR)
operations~\citep{hein1990reconstructing,song2003on,song2006properties}
induced by recombination between these partially reconciled local trees.
% NOTE: dropping this because it doesn't clarify the actual point.
% A similar problem is tackled by \arbores~\citep{heine2018bridging}, which
% explores a space of ARGs by fixing a starting ARG, and then repeatedly erasing
% and resimulating candidate SPR moves connecting two fixed, leaf-labelled local
% trees.
Inferring the SPRs between leaf-labelled trees is
NP-hard~\citep{hein1996complexity,allen2001subtree,bordewich2005computational},
but it is unclear what the complexity is when there
is a degree of internal node sharing between trees.
% I don't really know if this is true, it's just a way of wrapping the
% section up by saying that other things we thought were settled might
% be worth thinking about again.
The combinatorial formulation of ARGs and local trees provided here
may help clarify these fundamental questions.
\section{Locally unary nodes}
\label{sec-locally-unary-edges}
As discussed in the previous section, the local tree at a given position
$x$ is best seen as the path through the ARG at that position, defined
by the oriented tree $\pi^x_1\dots\pi^x_n$. This path does not directly
contain information about branching points, and defining a
node's arity (number of child nodes) is therefore useful.
The ``local arity'' of a node is the number of children it has
in the local tree at position $x$, i.e., $a^x_u = |\{v : \pi^x_v = u\}|$
for each $1 \leq u \leq n$. The ``ARG arity'' of a node $u$ is the