-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathrehh.Rmd
1154 lines (875 loc) · 84.3 KB
/
rehh.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Vignette for package *rehh*"
author: "Alexander Klassmann, Renaud Vitalis and Mathieu Gautier"
date: "`r Sys.Date()`"
output:
bookdown::html_document2:
base_format: rmarkdown::html_vignette
toc: yes
bookdown::pdf_document2:
toc: yes
fig_caption: yes
number_sections: yes
fontsize: 12 pt
urlcolor: blue
bibliography: vignette.bib
csl: genetics.csl
header-includes:
- \numberwithin{equation}{section}
vignette: >
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
%\VignetteIndexEntry{Vignette for package *rehh*}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = ">",fig.height = 4.5, fig.width = 4.5, fig.show = "hold")
```
\clearpage
# About the package
This vignette describes comprehensively how the R package *rehh* can be applied to perform whole genome scans for footprints of selection using statistics related to *Extended Haplotype Homozygosity (EHH)* [@Sabeti2002]. The vignette *Examples in detail* explains basic usage and methodology with the help of two tiny artificial data sets.
The package accepts multi-allelic genetic markers as input. Typically, albeit not necessarily, these will be bi-allelic SNPs.
The package is available for Linux, Windows and MacOS X from the CRAN repository https://cran.r-project.org/package=rehh and may be installed using a standard procedure. Once the package has been successfully installed, it can be loaded by:
```{r library, message = FALSE}
library(rehh)
```
## Background
The analysis of molecular population genetic data often comprises the search for genomic regions that might have experienced recent selection. Diverse approaches have been developed; for reviews on methodology see [@Sabeti2006], [@Oleksyk2010] and [@Vitti2013] and for
practical advice [@Cadzow2014], [@Utsunomiya2015] and [@Weigand2018]. However only a few have found wide-spread application [@Haasl2016]. To the latter belong *iHS* [@Voight2006], *Rsb* [@Tang2007] and *XP-EHH* [@Sabeti2007], all of which are *summary statistics* aimed to distill a certain aspect of the genetic data into a single score and constructed in a way that extreme values are indicative of positive or "Darwinian" selection. *iHS* is intended for application on a single (presumably homogeneous) population, while *XP-EHH* and *Rsb* are targeted to differential selection between two populations. All three statistics are based on the concept of *Extended Haplotype Homozygosity (EHH)* as formulated by [@Sabeti2002].
*iHS* and *XP-EHH* can be calculated by the independent C++ command line tool *Hapbin* [@Maclean2015], which has been optimized for speed by exploiting bit-wise machine-level operations. The package *rehh* cannot compete on performance, but has the advantage of being able to work with multi-allelic markers and missing values. Moreover, it possesses a broader range of input and output options, including several graphical representations.
## Changes between versions 2.X and 3.X
Due to changes in the API, although mostly small, the versions 3.X are not compatible with versions 2.X.
Data objects of class `haplohh` (see below) created by versions up to 2.0.4 must be updated via the command `update_haplohh()` (see its documentation by typing `?update_haplohh`) in order to be accepted by the functions of the current version.
## Example files
For illustration purposes, several example input files as well as R objects are provided.
### Input files
- Two tiny invented examples, each in our "standard" haplotype format (see section \@ref(input)) and in *variant call format* (*vcf*). The first example was used in [@Gautier2017] for the explanation of the various statistics calculated by this package. The second example is an extension of the former including multi-allelic markers and missing values. Both sets are discussed in depth in a second vignette.
- Further three tiny examples used for the supplement on Site Frequency Spectrum-based methods of [@Klassmann2020].
- An output file of the program *ms* containing two small simulated haplotype samples.
- A data set in various formats that originated from a study on the "Creole cattle breed from Guadeloupe" (CGU) [@Gautier2011]. All files contain the same set of phased SNPs of *Bos taurus* chromosome 12 from 140 individuals.
All of the above files are copied into the current working directory via the command
```{r make_examples, results = 'hide'}
make.example.files()
```
**Throughout the vignette, this command is assumed to have been run!**
### R objects
- The data set for chromosome 12 of the cattle study mentioned above as object of the *rehh* data class `haplohh`. This object becomes available by the command `data(haplohh_cgu_bta12)`.
- The scores *iHH* and *iES* as obtained by the function `scan_hh()` applied on SNPs of the whole genome for the population CGU (defined above) and another population EUT ("European taurine"). They reside in the accompanying package `rehh.data` and are obtained by `library(rehh.data)` followed by `data(wgscan.cgu)` resp. `data(wgscan.eut)`.
## Terminology
Depending on the context, we refer as *chromosomes* either to the different chromosomes of a species' reference sequence (e.g. "chr1", "chr2", "chrX", etc.) or to the homologous chromosomes from different individuals (e.g. 280 times "chr12" from 140 diploid individuals).
The sequence of alleles on a chromosome is referred to as its *haplotype* and sometimes we use it interchangeably with *chromosome*. However, if we speak of a *shared haplotype*, the term is meant more abstractly for the unique sequence of alleles that a group of chromosomes share at each position within a certain region.
## Overview
The package calculates three statistics which can be used to perform whole-genome scans for selection: *iHS*, *XP-EHH* and *Rsb*. *iHS* compares alleles within a single population while the other two compare sites between populations. The calculation proceeds technically in five steps which are performed by running two commands and combining tables:
- each marker is taken in turn as a "focal marker" around which the extended haplotype homozygosity is computed at further markers in increasing distance to the focal marker up to some stop criterion
- for each focal marker these quantities are "integrated" over the surrounding markers
- these integrals have to be calculated for each chromosome separately and the resulting tables to be combined to yields whole-genome statistics
- at each focal position, two such integrals are compared (either from two alleles or from two populations) by taking their log ratio
- the distribution of these log-ratios is standardized
Here is a minimal code example for calculating *iHS* on a single chromosome:
```{r minimalcodeexample, fig.align = 'center', results = "hide"}
hh <- # data input
data2haplohh(
hap_file = "bta12_cgu.hap",
map_file = "map.inp",
chr.name = "12",
allele_coding = "map"
)
scan <- scan_hh(hh) # calculation of EHH and integration
# (combine results from different chromosomes)
ihs <- ihh2ihs(scan) # log ratio for alleles and standardization
manhattanplot(ihs) # plot of the statistics
```
# Data input {#input}
The package *rehh* requires as input:
- a haplotype data file (see section \@ref(hapfile)).
and, if the haplotype data file is neither in *variant call format* nor in the format of *ms* output,
- a marker information file (see section \@ref(mapfile)).
Since R can automatically recognize and read compressed files with standard name extensions like *.gz* or *.bzip2*, this holds as well for the input files that can be used with this package.
## Haplotype data file {#hapfile}
Five haplotype input file formats are supported:
- a *standard* haplotype format. Each row represents a haplotype with marker genotype in columns as in the example file `bta12_cgu.hap` containing 280 haplotypes with 1424 SNPs each (see section \@ref(LoadDataEx1)). The first element of each row is taken as a haplotype identifier.
- a *transposed* format with haplotypes in columns and markers in rows as in the example file `bta12_cgu.thap`. This format is similar to the one produced by the phasing program *SHAPEIT2* [@OConnell2014]. This format assumes neither row nor column names and hence no haplotype identifiers can be specified (see section \@ref(LoadDataEx2)).
- the output file format from the phasing program *fastPHASE* [@Scheet2006] as in the `bta12_hapguess_switch.out` example file. Note that this file format allows to include haplotypes from several populations (if the -u *fastPHASE* option was used) (see section \@ref(LoadDataEx3)).
- *variant call format (vcf)*, comprising both haplotype and marker information. In order to read files in this format, the package [vcfR](https://cran.r-project.org/package=vcfR) or the packages [data.table](https://cran.r-project.org/package=data.table) and [R.utils](https://cran.r-project.org/package=R.utils) (the latter is needed for compressed files) have to be installed (see section \@ref(LoadDataEx4)).
- The output of the simulation program *ms* [@Hudson2002] and its derivatives *msHOT* [@Hellenthal2007], *ms^2^* [@Ewing2010] and *ms'* [@Kelleher2016]. In order to read these files, the package [gap](https://cran.r-project.org/package=gap) has to be installed (see section \@ref(LoadDataEx5)).
Alleles in standard or transposed haplotype format can be provided either coded (by integer numbers) or without coding (e.g. as nucleotides) (see section \@ref(LoadData)).
If alleles are not polarized, i.e. their ancestral/derived status is not known, some arbitrary assignation can be done either beforehand by the user or during input by the package; appropriate parameters must be set in subsequent functions (see section \@ref(polarizing)).
## Marker information file {#mapfile}
For haplotype data files in standard, transposed or fastphase format a marker information file has to be provided. The file should contain columns without header corresponding to:
1. marker name/identifier
2. name of the chromosome (or scaffold)
3. position on the chromosome (or scaffold), either physical (in bases) or genetic (e.g. in centiMorgans).
**For a given chromosome, the markers must be ordered in the same way as in the haplotype data file.**
If alleles in the haplotype file are not already provided in coded form (i.e. as positive integers), the package can (re)code them using two additional columns in the marker file:
4. ancestral allele
5. derived allele(s).
For instance, the provided file `map.inp` has five columns:
```{r map_inp}
# show first 6 lines of file
cat(readLines("map.inp", n = 6), sep = "\n")
```
## Loading data files: the function `data2haplohh` {#LoadData}
The `data2haplohh()` function converts data files into an R object of class `haplohh`, forming the basis for subsequent analysis. The following options are available to recode alleles and/or to filter markers and haplotypes:
**1. Allele coding options**:
The parameter `allele_coding` has four options: three of them expect the user to provide a coding, while the fourth option overwrites any user defined coding.
The package accepts two different integer codings within the haplotype file which can be specified as `"12"` or `"01"`, respectively:
| | "12" | "01" |
|-----------------|:----:|:----------:|
|Missing value | 0 | `NA` or . |
|Ancestral allele | 1 | 0 |
|Derived allele 1 | 2 | 1 |
|Derived allele 2 | 3 | 2 |
|... | ... | ... |
The latter one is also the coding used internally by our package.
If the alleles are given in an un-coded form (e.g. as nucleotides), the alleles can be coded for each marker separately with help of the marker information file. This option is activated by `allele_coding = "map"`. If an allele of a marker is found in the fourth column of the marker information file, it is re-coded as 0 (ancestral) and if it is found in the fifth column it is re-coded as 1 (derived). The fifth column may contain several alleles, separated by commas, like in the *variant call format*. In this case, the internal coding extends to higher integers. If an allele present in the haplotype file is not found neither in the fourth nor the fifth column of the marker information file, it is counted as `NA` (missing value).
If the alleles are not polarized, i.e. it is unknown which allele is ancestral and which derived, the option `"none"` should be selected. The alleles will then be coded in alpha-numeric order for each marker; the first becomes 0, the second 1, etc. Missing values must be given as `NA` or '.' while any other number, character or string will be counted as an allele. Options of subsequent analysis functions have to be set accordingly (see section \@ref(polarizing)).
**2. Filtering options**
2.1 **Discard haplotypes with a high amount of missing data**. If haplotypes contain missing data (presumably a rare case since most phasing programs allow imputing missing genotypes), it is possible to discard those with a fraction of less than `min_perc_geno.hap` of markers genotyped. Its default is `NA`, hence no restriction.
2.2. **Discard markers with a high amount of missing data**. It is possible to discard individual markers genotyped on a fraction of less than `min_perc_geno.mrk` of haplotypes. By default `min_perc_geno.mrk = 100` meaning that only fully genotyped markers are retained. A value of `NA` or zero is not allowed since *rehh* cannot handle markers with no data.
2.3 **Discard markers with a low Minor Allele Frequency**. It is possible to discard markers with a Minor Allele Frequency (MAF) equal to or below `min_maf`. Its default is `NA` meaning no constraint. Setting this value to zero eliminates monomorphic markers. Note that since low-frequency alleles usually abound, any positive value may exclude a substantial fraction of the data.
The arguments `min_perc_geno.hap`, `min_perc_geno.mrk` and `min_maf` are evaluated in this order.
More details about input options can be found in the documentation using the command:
```{r eval = FALSE}
?data2haplohh
```
### Example 1: reading haplotype file in standard format {#LoadDataEx1}
The following command converts the haplotype input file `bta12_cgu.hap` (in "standard" haplotype format) and marker information input file `map.inp` to an `haplohh` object named `hh`. Because the map file contains information for markers mapping to multiple chromosomes we have to specify by `chr.name = 12` that the haplotype input file is about chromosome 12. Alleles are given as nucleotides in the haplotype file and coded with help of the map file by setting `allele_coding = "map"`.
```{r example1}
hh <- data2haplohh(hap_file = "bta12_cgu.hap",
map_file = "map.inp",
chr.name = 12,
allele_coding = "map")
```
If no value is specified for `chr.name` and more than one chromosome is detected in the map file, an error is produced:
```{r error = TRUE}
hh <- data2haplohh(hap_file = "bta12_cgu.hap",
map_file = "map.inp",
allele_coding = "map")
```
Furthermore, the following message is prompted if the number of markers for the chromosome in the info file does not correspond to the one in the haplotype file (for instance when a wrong chromosome is specified):
```{r error = TRUE}
hh <- data2haplohh(hap_file = "bta12_cgu.hap",
map_file = "map.inp",
chr.name = 18,
allele_coding = "map")
```
### Example 2: reading haplotype file in transposed format (*SHAPEIT2*--like) {#LoadDataEx2}
If the haplotype input file is in "transposed" format (like `bta12_cgu.thap`), the option `haplotype.in.columns` has to be set to `TRUE` while all other parameters remain unaffected with respect to example 1. This is the only format which is not recognized automatically, but has to be explicitly declared by the user.
```{r example2}
hh <- data2haplohh(hap_file = "bta12_cgu.thap",
map_file = "map.inp",
chr.name = 12,
allele_coding = "map",
haplotype.in.columns = TRUE)
```
### Example 3: reading haplotype file in *fastPHASE* output format {#LoadDataEx3}
In this example, we use the *fastPHASE* output `bta12_hapguess_switch.out` as haplotype file. This format is recognized automatically. As before, we need to set `chr.name = 12`.
Because haplotypes originated here from several populations (the *fastPHASE* option `-u` was used), it is necessary to select the population of interest (in our case "CGU")
by its corresponding number in the file, hence `popsel = 7`.
```{r example3}
hh <- data2haplohh(hap_file = "bta12_hapguess_switch.out",
map_file = "map.inp",
chr.name = 12,
popsel = 7,
allele_coding = "map")
```
If no value is specified for the `popsel` argument and more than one population is detected in the *fastPHASE* output file, an error is produced and the available population numbers printed:
```{r error = TRUE}
hh <- data2haplohh(hap_file = "bta12_hapguess_switch.out",
map_file = "map.inp",
chr.name = 12,
allele_coding = "map")
```
### Example 4: reading vcf files {#LoadDataEx4}
The function `data2haplohh()` checks automatically whether the specified haplotype input file is in *variant call format (vcf)*. If this is the case, the parameters `map_file` and `allele_coding` are ignored. By default, the function tries to polarize (see section \@ref(polarizing)) the alleles of each marker using the ancestral allele, expected to be given by key `AA` of the `INFO` field.
Ancestral alleles are sometimes marked by upper case as "high confident" and by lower case as "low confident". The default setting `capitalize_AA = TRUE` lifts this distinction before polarization.
If the `AA` key is absent, the option `polarize_vcf` should be set to `FALSE` and the allele coding of the *vcf* file is directly used as internal coding.
If there is data for more than one chromosome in the file, the chromosome of interest has to be specified by `chr.name`. Since always the whole file is read in, it is advisable to split large data sets into chromosome-specific files.
In order to process *vcf* files, the package [vcfR](https://cran.r-project.org/package=vcfR) or the package
[data.table](https://cran.r-project.org/package=data.table) (which in turn needs [R.utils](https://cran.r-project.org/package=R.utils) to read compressed files) have to be installed. The parameter `vcf_reader` has to be set to either `"vcfR"` or `"data.table"`.
In the file `bta12_cgu.vcf.gz` the ancestral allele was set as reference and hence no further polarizing is necessary.
```{r vcf_example, eval = requireNamespace("data.table", quietly = TRUE) & requireNamespace("R.utils", quietly = TRUE)}
hh <- data2haplohh(hap_file = "bta12_cgu.vcf.gz",
polarize_vcf = FALSE,
vcf_reader = "data.table")
```
### Example 5: reading ms output {#LoadDataEx5}
The function `data2haplohh()` automatically checks whether the haplotype file is in the output format of the simulation program *ms* [@Hudson2002]. If this is the case, the parameters `map_file` and `allele_coding` are ignored. If the file contains several 'runs' (as referred to by the parameter `nrep` of *ms*), it is necessary to specify the number of the run in option `chr.name`. Note that always the whole file is read, so that it might be advisable to spread large simulations over separate files.
One argument of the `data2haplohh` function is specifically dedicated to *ms* output, although it works with other formats as well: *ms* gives chromosomal positions as fractions of the interval [0,1] and in order to obtain more realistic values, these positions can be multiplied by a factor, set by `position_scaling_factor`.
Note that *ms* output can contain multiple markers with the same (rounded) position, which *rehh* does not accept. In this case the numerical precision for chromosomal positions in the *ms* output should be increased (option `-p` of *ms*, option `-oformat` of *msms*).
Setting `remove_multiple_markers` to `TRUE` entails that from consecutive markers with the same position only the first one is retained and a warning containing the number of removed markers is printed. Note that this effectively transforms the "infinite sites model" used for simulations by *ms* into a "finite sites model".
In order to read the *ms* format, the package [gap](https://cran.r-project.org/package=gap) has to be installed.
```{r ms_example, eval = requireNamespace("gap", quietly = TRUE)}
hh <- data2haplohh(hap_file = "ms.out",
chr.name = 2,
position_scaling_factor = 1000)
```
## Subset data
Sometimes it might be of interest to restrict the computation of statistics to subsets of haplotypes or markers. This can be done with help of the function `subset()` and arguments `select.hap` and `select.mrk`. After subsetting, the same filters are applied as during data input from files in order to ensure that no marker has exclusively missing values and, optionally, to exclude markers which are monomorphic within the subset by setting `min_maf` to zero.
For example, to restrict the output of *ms* (see section \@ref(LoadDataEx5)) to the first five haplotypes and filter out monomorphic sites, we apply
```{r}
hh_subset = subset(hh, select.hap = 1:5, min_maf = 0)
```
while the following command omits the first marker:
```{r}
hh_subset = subset(hh, select.mrk = -1)
```
# Computing *EHH*, *EHHS* and their "integrals" *iHH* and *iES*
## Definition and computation
### The (allele-specific) *Extended Haplotype Homozygosity (EHH)* {#EHH}
For an allele $a$ of a focal marker $s$, sometimes referred to as a *core* allele, the *Extended Haplotype Homozygosity (EHH)* is defined as the probability that two randomly chosen chromosomes, carrying the core allele, are homozygous over a given surrounding chromosomal region [@Sabeti2002]. It is estimated from a sample by calculating the homozygosity of the chromosomal chunk between the focal marker and another marker $t$ by the formula
\begin{equation}
\mathrm{EHH}_{s,t}^a=\frac{1}{n_{a}(n_a-1)}\sum\limits_{k=1}^{K^a_{s,t}}n_k(n_k-1)
(\#eq:ehh)
\end{equation}
where $n_a$ represents the number of chromosomes carrying the core allele $a$, $K^a_{s,t}$ represents the number of **different** shared haplotypes and $n_k$ refers to the number of chromosomes pertaining to the $k$-th such shared haplotype. If there is no missing data, it holds that $n_a=\sum\limits_{k=1}^{K^a_{s,t}}n_k$.
In the case of unphased chromosomes from diploid individuals (see section \@ref(phasing)) the extended haplotype homozygosity can be calculated as follows [@Tang2007]:
we consider only chromosomes from individuals that are homozygous for the allele $a$ at the focal marker $s$ and estimate *EHH* at some marker $t$ by the fraction of individuals that are (still) homozygous over the entire chromosomal stretch between $s$ and $t$. Let $I^a_{s,t}$ denote the number of individuals that are homozygous from marker $s$ til marker $t$.
\begin{equation}
\mathrm{EHH}_{s,t}^a=\frac{I_{s,t}^a}{I_{s,s}^a}\;.
(\#eq:ehh2)
\end{equation}
*EHH* is usually computed only for a region it surpasses a given threshold (e.g., $EHH > 0.05$).
### The integrated *EHH* (*iHH*) {#iHH}
By definition, *EHH* starts at 1 and decays to 0 with increasing distance of *t* from the focal marker *s*. For a given core allele, the integrated *EHH* (*iHH*) is defined as the area under the *EHH* curve which, in turn, is defined by the *EHH* values and associated chromosomal positions [@Voight2006]. The integral is computed with a simple numerical method, called the *trapezoidal rule*.
Note that, technical details aside, the *iHH* value is nothing else than the average length of shared haplotypes.
### The (site-specific) Extended Haplotype Homozygosity (*EHHS*) {#EHHS}
An extended haplotype homozygosity can be defined as well without regard to core alleles. In this case,
the quantity is aimed to reflect the probability that any two randomly chosen chromosomes from a population are homozygous over a given surrounding chromosomal region of a focal marker. In contrast to the allele-specific *EHH* defined in the previous section, the chromosomes are not required to carry a specific allele at the focal marker. We adopt the naming by [@Tang2007] as *site--specific* EHH, abbreviated by *EHHS*. Note, however that this quantity is sometimes referred to as *EHH*, too, and there is no agreed notation in the literature.
*EHHS* was used in genome scans in two versions: un-normalized by [@Sabeti2007] and normalized by [@Tang2007].
In line with [@Sabeti2007] we define
\begin{equation}
(\#eq:ehhssab)
\mathrm{EHHS}_{s,t}=\frac{1}{n_s(n_s-1)}\sum\limits_{k=1}^{K_{s,t}}n_k(n_k-1)
\end{equation}
where we re-use notation from above and let $n_s$ refer to the number of chromosomes at marker $s$. If there are no missing values at that marker, this is simply the number of chromosomes in the sample.
[@Tang2007] proposed an apparently different estimator for the *normalized EHHS*, which we abbreviate to *nEHHS*, namely
\begin{equation}
(\#eq:ehhstang)
\mathrm{nEHHS}_{s,t}=\frac{1-h_{s,t}}{1-h_s}
\end{equation}
where:
1. $h_s=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,s}}n_k^2 \right )\right)$ is an estimator of the focal marker heterozygosity. In the bi-allelic case we can write this as $h_s=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(n_{a1}^2+n_{a2}^2\right) \right)$ with $a1$ and $a2$ referring to the numbers of the two alleles at marker $s$ ($n_{a1}+n_{a2}=n_s$).
2. $h_{s,t}=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,t}}n_k^2 \right )\right)$ is an estimator of haplotype heterozygosity in the chromosomal region extending from marker $s$ to marker $t$.
However both definitions are in fact equivalent, because it holds
$\mathrm{EHHS}_{s,t}=1-h_{s,t}$ and hence
\begin{equation*}
\mathrm{nEHHS}_{s,t}=\frac{\mathrm{EHHS}_{s,t}}{\mathrm{EHHS}_{s,s}}\;.
\end{equation*}
Thus $\mathrm{nEHHS}_{s,t}$ is just normalized in order to yield 1 at the focal marker $s$. Note that the normalization factor depends on the frequency of the alleles at the focal marker and consequently is not necessarily the same for different focal markers.
Furthermore, we note that *EHHS* and *EHH* are related by
\begin{equation*}
\mathrm{EHHS}_{s,t}=\frac{n_{a1}(n_{a1}-1)}{n_s(n_s-1)}\mathrm{EHH}_{s,t}^{a1}+\frac{n_{a2}(n_{a2}-1)}{n_s(n_s-1)}\mathrm{EHH}^{a2}_{s,t}\;,
\end{equation*}
where for the sake of simplicity we assume that the focal marker has only two alleles $a1$ and $a2$. *EHHS* might hence be viewed as a linear combination of the *EHH*'s of the focal alleles, weighted by roughly the square of the focal allele frequencies.
In the case of unphased chromosomes from diploid individuals (see section \@ref(phasing)) *EHHS* can be calculated like *EHH* in Equation \@ref(eq:ehh2), just without reference to core alleles:
\begin{equation}
\mathrm{EHHS}_{s,t}=\frac{I_{s,t}}{I_{s,s}}\;.
(\#eq:ehhs2)
\end{equation}
Note that defined this way, *EHHS* is always 1 at the focal marker. Hence there is no distinction between $\mathrm{EHHS}$ and $\mathrm{nEHHS}$.
Again, unphased *EHHS* can be related to unphased *EHH* by
\begin{equation}
\mathrm{EHHS}_{s,t}=\frac{I_{s,t}}{I_{s,s}}=\frac{I_{s,t}^{a1}+I_{s,t}^{a2}}{I_{s,s}^{a1}+I_{s,s}^{a2}}=\frac{I_{s,s}^{a1}}{I_{s,s}^{a1}+I_{s,s}^{a2}}\mathrm{EHH}_{s,t}^{a1}+\frac{I_{s,s}^{a2}}{I_{s,s}^{a1}+I_{s,s}^{a2}}\mathrm{EHH}_{s,t}^{a2}\;
\end{equation}
where for the sake of simplicity we assumed a bi-allelic focal marker with alleles $a1$ and $a2$.
As with *EHH*, the *EHHS* is usually computed only for the region where its value surpasses a given threshold (e.g., $EHHS > 0.05$).
### The integrated *EHHS* (*iES*) {#iES}
Like *EHH*, *EHHS* has its maximum at the focal marker and decays to 0 with increasing distance from the focal marker. For a given focal marker, analogously to *iHH*, *iES* is defined as the integrated *EHHS* [@Tang2007]. Depending on whether *EHHS* or *nEHHS* is integrated, we yield *iES* and *inES* respectively. As with *iHH*, the numerical integration uses the *trapezoidal rule*.
Note that, technical details aside, the *iES* and *inES* values represent the average length of shared haplotypes. The length of shared haplotypes with different core alleles yields zero and these are included in the former but not the latter.
## The function `calc_ehh()` {#calcehh}
The function `calc_ehh()` computes *EHH* for all alleles of a focal marker $s$ relative to markers $t$ upstream and downstream. For each allele the corresponding integral *iHH* of the *EHH* curve is calculated as well.
Three options can be specified to constrain the computation of *EHH*:
- `limehh` sets a threshold below which further calculation of *EHH* is stopped. Its default value is 0.05. Note that lowering this cut-off might actually decrease the power to detect selective events since under neutrality a tiny fraction of sequences has very long shared haplotypes which, if not capped, confound the signal of selection [@Klassmann2020].
- `limhaplo` defines the smallest acceptable number of evaluated chromosomes and has a default (and mimimum) value of 2. This parameter might be increased if missing values are suspected to be non-randomly distributed leading to a biased drop-out of evaluated chromosomes.
- `limhomohaplo` sets a minimum number of homozygous chromosomes below which calculation of *EHH* is stopped (or not even started). Its default (and minimum) value is 2. This number should be increased to 4 for small samples of unphased haplotypes in order to limit the influence of a single shared haplotype (see section \@ref(phasing).)
Several parameters influence the *IHH* values (the integral over *EHH*):
- by default, when the calculation of *EHH* reaches the border of the chromosome, i.e. the first or last marker, or encounters a large gap (see below), the integral *iHH* is discarded in order to avoid possible boundary effects. In order to retain the integrals set `discard_integration_at_border` to `FALSE`.
- to account for gaps between consecutive markers, two parameters can be set (see also section \@ref(gaps))
- gaps greater than `scalegap` are re-scaled (capped) to that size.
- integration is stopped/discarded if a gap greater than `maxgap` is encountered.
- integration is performed by default over the area between the graph defined by the *EHH* values and the horizontal line y = `limehh`. If numerical agreement with the program *Hapbin* is wanted, this area should be extended to the x-axis by setting `lower_y_bound` to zero.
- by default, the *EHH* curve is defined by linearly interpolating *EHH* values between consecutive markers, yielding a continuous curve. However, in particular for full re-sequencing data, it is more accurate to let this function decrease step-wise at each marker by setting `interpolate` to `FALSE` (although the difference is likely to be minor).
The option `polarized`, `TRUE` by default, in this function merely affects the order and labeling of alleles.
The parameter `phased` can be toggled to `FALSE` in order to calculate *EHH* by the formula for unphased data (see section \@ref(EHH) and \@ref(phasing)).
The option `include_zero_values` concerns only the output. If `FALSE` (by default), the output contains only *EHH* values for markers up to the specified cut-offs. If the parameter is toggled to `TRUE`, then values for all markers are returned, the remaining ones being zero.
Details are available in the R documentation by using the command:
```{r eval=FALSE}
?calc_ehh
```
In the following example, *EHH* is computed around the SNP `F1205400`. This SNP is associated with a strong signal of selection (using both *iHS* and *Rsb* statistics) and is located closely (<5kb) to a strong candidate gene involved in horn development [@Gautier2011].
Note that the R object `haplohh_cgu_bta12` was generated using the `data2haplohh()` function with the example input files (see section \@ref(LoadDataEx1)).
```{r}
#example haplohh object (280 haplotypes, 1424 SNPs) see ?haplohh_cgu_bta12 for details
data(haplohh_cgu_bta12)
#computing EHH statistics for the focal SNP with name "F1205400"
#which displays a strong signal of selection
res <- calc_ehh(haplohh_cgu_bta12,
mrk = "F1205400",
include_nhaplo = TRUE)
```
The output contained in `res` is a list with four elements:
1. `mrk.name`: the name/identifier of the focal marker.
2. `freq`: a vector containing the frequencies of all alleles of the focal marker.
3. `ehh`: a data frame containing the *EHH* values for each allele of the focal marker. Setting the parameter `include_nhap` to `TRUE` adds columns which show how many chromosomes contributed at each position to the calculation of *EHH*.
4. `ihh`: a vector with *iHH* for each allele, i.e. the integrals over the *EHH* curves defined by the *EHH* values.
```{r}
res$mrk.name
```
```{r}
res$freq
```
```{r}
res$ehh
```
```{r}
res$ihh
```
The *EHH* values can be plotted by setting the output of `calc_ehh()` into the function `plot()` to yield Figure \@ref(fig:ehh):
```{r, ehh, fig.align = 'center', fig.cap = "Graphical output of the plot.ehh() function", fig.pos = '!h', fig.lp = 'fig:'}
plot(res)
```
For comparison, we might ignore that haplotypes have been phased and set the option `phased` to `FALSE`. Figure \@ref(fig:ehh2) shows that for this particular marker the differences are rather small:
```{r ehh2, fig.align = 'center', fig.cap = "Graphical output of the plot.ehh() function", fig.pos = '!h', fig.lp = 'fig:'}
plot(calc_ehh(haplohh_cgu_bta12,
mrk = "F1205400",
phased = FALSE),
xlim = c(2.55E7, 3.05E7))
```
## The function `calc_ehhs()`
The `calc_ehhs()` function computes *EHHS* and normalized *EHHS* around the focal marker $s$ relative to another marker $t$. This function also computes the corresponding integrals *iES* and *inES* respectively. The options are identical to those of the function `calc_ehh` (see previous section), except that `polarized` is absent, because variant ancestry status does not figure in the formulas. Details are available by the command:
```{r, eval=FALSE}
?calc_ehhs
```
In the following example, the *EHHS* statistics are computed around the SNP `F1205400`.
```{r}
data(haplohh_cgu_bta12)
res <- calc_ehhs(haplohh_cgu_bta12,
mrk = "F1205400",
include_nhaplo = TRUE)
```
The output is similar to that of `calc_ehh()`, except that there are no alleles to be distinguished, but instead the wether *EHHS* is normalized or not. A list with four elements is obtained:
1. `mrk.name`: the name/identifier of the focal marker.
2. `ehhs`: a data frame with *EHHS* and *nEHHS* values along the chromosome around the focal marker. Optionally, the column `NHAPLO` can be included to show how many chromosomes were evaluated at each marker.
3. `IES`: *iES*, i.e. the integral over the EHHS curve.
4. `INES`: *inES*, i.e. the integral over the nEHHS curve.
```{r}
res$mrk.name
```
```{r}
res$ehhs
```
```{r}
res$IES
```
```{r}
res$INES
```
Figure \@ref(fig:ehhs) can be obtained by setting the output of `calc_ehhs` into the function `plot.ehhs()`:
```{r ehhs, fig.align = 'center', fig.cap = 'Graphical output of the plot.ehhs() function', fig.pos = '!h', fig.lp = 'fig:'}
plot(res)
```
By default, the un-normalized *EHHS* values are plotted. This can be toggled by setting `nehhs = TRUE`. As can be seen in Figure \@ref(fig:nehhs), the values are scaled to yield 1 at the focal marker.
```{r nehhs, fig.align = 'center', fig.cap = 'Graphical output of the plot.ehhs() function', fig.pos = '!h', fig.lp = 'fig:'}
plot(res, nehhs = TRUE)
```
The parameter `phased` of the function `calc_ehhs()` can be toggled to `FALSE` yielding Figure \@ref(fig:ehhs2). As described above, for unphased data the *EHHS* values are always normalized. For this particular marker we get a similar picture as for phased *nEHHS*.
```{r ehhs2, fig.align = 'center', fig.cap = 'Graphical output of the plot.ehhs() function', fig.pos = '!h', fig.lp = 'fig:'}
plot(calc_ehhs(haplohh_cgu_bta12,
mrk = "F1205400",
phased = FALSE))
```
## The function `scan_hh()` {#scanhh}
While the functions `calc_ehh()` and `calc_ehhs()` return the integrated statistics *iHH* and *iES* for a particular marker, the function `scan_hh()` calculates these values for all markers in the `haplohh` object. However, in contrast to `calc_ehh()` which computes an *iHH* value for each allele of a possibly multi-allelic focal marker, `scan_hh()` calculates only two *iHH* values, namely (by default) for the ancestral allele and the derived allele with highest frequency. In case of unpolarized markers (see section \@ref(polarizing)), the option `polarized` should be set to `FALSE`, causing the function to use major and minor (most frequent and second most frequent) alleles instead.
The remainder of options are essentially the same as for the functions `calc_ehh()` and `calc_ehhs()` (see section \@ref(EHH)). Details are given in the documentation:
```{r eval=FALSE}
?scan_hh
```
As an example, in order to scan the `haplohh_cgu_bta12` object (containing data on 1424 SNPs for 280 haplotypes), one may use the following command:
```{r}
data(haplohh_cgu_bta12)
scan <- scan_hh(haplohh_cgu_bta12)
```
The resulting object `scan` is a data frame with `r nmrk(haplohh_cgu_bta12)` rows (the number of markers declared in the `haplohh` object) and 10 columns:
1. chromosome name
2. position in the chromosome
3. sample frequency of the ancestral / major allele
4. sample frequency of the second-most frequent remaining allele
5. number of evaluated haplotypes at the focal marker for the ancestral / major allele
6. number of evaluated haplotypes at the focal marker for the second-most frequent remaining allele
7. *iHH* of the ancestral / major allele
8. *iHH* of the second-most frequent remaining allele
9. *iES*
10. *inES*
Note that while for phased data the number of evaluated haplotypes of a core allele corresponds to its frequency in the sample, in case of unphased data the evaluation is restricted to haplotypes of homozygous individuals.
The following command prints a chunk of the scan containing the marker `F1205400`:
```{r}
scan[453:459,]
```
Note that `scan_hh()` is more efficient than `calc_ehh()` and `calc_ehhs()` applied consecutively for each marker as can be seen by running the two code snippets below:
```{r}
# perform scan using scan_hh
system.time(scan <- scan_hh(haplohh_cgu_bta12))
```
```{r slow_scan}
# perform scan applying calc_ehh and calc_ehhs to each marker
slow_scan_hh <- function(haplohh) {
# create empty vectors of size nmrk
IHH_A <- IHH_D <- IES <- INES <- vector("numeric", nmrk(haplohh))
# invoke calc_ehh and calc_ehhs for each marker
for (i in 1:nmrk(haplohh)) {
res <- calc_ehh(haplohh, mrk = i)
IHH_A[i] <- res$ihh["IHH_A"]
IHH_D[i] <- res$ihh["IHH_D"]
res <- calc_ehhs(haplohh, mrk = i)
IES[i] <- res$IES
INES[i] <- res$INES
}
# create data frame (the return value of this function)
data.frame(IHH_A = IHH_A,
IHH_D = IHH_D,
IES = IES,
INES = INES)
}
system.time(slow_scan <- slow_scan_hh(haplohh_cgu_bta12))
```
Comparing columns shows that the computed values are identical:
```{r scancomp}
identical(slow_scan[, "IHH_A"], scan[, "IHH_A"])
identical(slow_scan[, "IHH_D"], scan[, "IHH_D"])
identical(slow_scan[, "IES"], scan[, "IES"])
identical(slow_scan[, "INES"], scan[, "INES"])
```
# Computing *iHS*, *Rsb* and *XP-EHH*
## The *iHS* within-population statistic
### Definition {#ihs}
The abbreviation *iHS* refers to "integrated haplotype homozygosity score". Let *uniHS* represent the un-standardized log-ratio of ancestral *iHH*$^A$ to derived *iHH*$^D$ of a certain focal marker $s$:
$$\mathrm{uniHS}(s)=\ln\left(\frac{\mathrm{iHH}^A(s)}{\mathrm{iHH}^D(s)}\right)$$
Following [@Voight2006] we perform a standardization by setting:
$$\mathrm{iHS}(s)=\frac{\mathrm{uniHS}(s) - \mathrm{mean}(\mathrm{uniHS}|p_s)}{\mathrm{sd}(\mathrm{uniHS}|p_s)}$$
where $\mathrm{mean}(\mathrm{uniHS}|p_s)$ and $\mathrm{sd}(\mathrm{uniHS}|p_s)$ represent the mean and standard deviation of *uniHS* restricted to those markers with a similar derived allele frequency as observed at the focal marker ($p_s$). In practice, the markers are binned so that each bin contains not a single frequency but a small range of frequencies to yield enough markers for reliable estimates of mean and standard deviation.
*iHS* is constructed to have an approximately standard Gaussian distribution and to be comparable across markers regardless of their underlying allele frequencies.
For a practical characterization of "outliers" we stretch statistical notation and associate a p-value to *iHS* [@Gautier2011] by:
$$p_\mathrm{iHS}=-\log_{10}\left(2\Phi\left(-|\mathrm{iHS}|\right)\right)$$
where $\Phi\left(x\right)$ represents the Gaussian cumulative distribution function.
Assuming most of genotyped markers behave neutrally and the distribution of *iHS* values for neutral markers follows indeed a standard Gaussian distribution, we might interpret $p_\text{iHS}$ as a two-sided p-value (on a $-\log_{10}$ scale) of a test on the null hypothesis of no selection.
Alternatively, a one-sided p-value (in a $-\log_{10}$ scale) might be defined [@Gautier2011] as:
\begin{equation*}
p^\text{left}_\text{iHS}=-\log_{10}\left(\Phi\left(\text{iHS}\right)\right)
\end{equation*}
allowing the identification of those sites where the derived allele displays a significantly higher extended haplotype homozygosity than the ancestral allele or
\begin{equation*}
p^\text{right}_\text{iHS}=-\log_{10}\left(1-\Phi\left(\text{iHS}\right)\right)
\end{equation*}
for the opposite case.
Note that this procedure is controversial, because we identify the empirical distribution with the distribution under the null hypothesis of neutrality. This is an approximation at best and only warranted when it can be assumed that there are so few selected sites that their influence on the overall shape of the distribution can be neglected.
In case of unpolarized alleles, the *uniHS* is taken as the ratio of *iHH* from minor to major allele. Since derived allele frequency cannot be accounted for, no binning should be performed. The resulting standardized *iHS* cannot be expected to follow a normal distribution and p-values as defined above are not meaningful.
### The function `ihh2ihs()` {#ihh2ihs}
The `ihh2ihs()` function computes *iHS* using a data frame containing the *iHH* values for ancestral and derived (resp. major and minor) alleles as obtained by the `scan_hh()` function (see section \@ref(scanhh)). The argument `min_maf` allows to exclude focal markers according to their minor allele frequency (by default `min_maf`=0.05). The argument `freqbin` controls the size (or number) of the allele frequency bins used to perform standardization (see section \@ref(ihs)). More precisely, allele frequency bins are built from `min_maf` to 1-`min_maf` in steps of size `freqbin` (by default `freqbin`=0.025). If an integer of 1 or greater is specified, a corresponding number of equally spaced bins is created. If `freqbin` is set to 0, standardization is performed considering each observed frequency as a discrete frequency class, which is useful when there are only a few distinct haplotypes.
For unphased data, *iHH* is calculated using only haplotypes of individuals which are homozygous at the focal marker. This number can be considerably lower than the absolute allele frequency. Hence, in addition to `min_maf`, the option `min_nhaplo` (default `NA`) should be used to reduce statistical noise arising from too few evaluated haplotypes.
Optionally, the allele frequencies of the input data frame can be included into the output by setting `include_freq` to `TRUE`.
A p-value is calculated for standardized *iHS* values. By default, it is two-sided, but a side can be chosen by setting argument `p.side` to `"left"` or `"right"`.
As a typical workflow for performing a whole genome scan one might run `scan_hh()` on haplotype data from each chromosome and concatenate the resulting data frames before standardization. In the following example, we assume that the haplotype files are named as `hap_chr_i.cgu` with $i=1,...,29$ and the marker information file is named `map.inp`. The data frame `wgscan` contains the $iHH^A$ and $iHH^D$ values for the whole genome and serves as input for the `ihh$ihs()` function which calculates *iHS*, i.e. the standardized log ratio of the two $iHH$ values, for each marker.
```{r ihh2ihs1, eval = FALSE}
## demo code - no data files for all chromosomes provided
for(i in 1:29) {
# haplotype file name for each chromosome
hap_file = paste("hap_chr_", i, ".cgu", sep = "")
# create internal representation
hh <- data2haplohh(hap_file = hap_file,
map_file = "map.inp",
chr.name = i,
allele_coding = "map")
# perform scan on a single chromosome (calculate iHH values)
scan <- scan_hh(hh)
# concatenate chromosome-wise data frames to
# a data frame for the whole genome
# (more efficient ways certainly exist...)
if (i == 1) {
wgscan <- scan
} else {
wgscan <- rbind(wgscan, scan)
}
}
# calculate genome-wide iHS values
wgscan.ihs <- ihh2ihs(wgscan)
```
For illustration, the object `wgscan.cgu` contains the calculated and concatenated $iHH$ values of a whole genome scan on population CGU [@Gautier2011]:
```{r ihh2ihs2}
library(rehh.data)
data(wgscan.cgu)
wgscan.ihs.cgu <- ihh2ihs(wgscan.cgu)
```
The resulting object `wgscan.ihs.cgu` is a list with two elements:
1. `ihs`: a data frame with *iHS* and corresponding p-value $p_\mathrm{iHS}$ (p-value in a $-\log_{10}$ scale assuming the *iHS* are normally distributed under the neutral hypothesis). The first rows are displayed below:
```{r}
head(wgscan.ihs.cgu$ihs)
```
2. `frequency.class`: a data frame summarizing the derived allele frequency bins used for standardization and mean and standard deviation of the un-standardized values. The first rows are shown below:
```{r}
head(wgscan.ihs.cgu$frequency.class)
```
## The *Rsb* pairwise population statistic
### Definition {#rsb}
The abbreviation *Rsb* stands for "**r**atio of EHH**S** **b**etween populations".
For a given marker $s$, let $$\mathrm{unRsb}(s)=\ln\left(\frac{\mathrm{inES}_\text{pop1}(s)}{\mathrm{inES}_\text{pop2}(s)}\right)$$ represent the log-ratio of the *inES* values computed in two populations $pop1$ and $pop2$ (see section \@ref(iES)).
The *Rsb* for marker $s$ is defined as the standardized *unRsb* [@Tang2007]:
\begin{equation*}
\mathrm{Rsb(s)}=\frac{\mathrm{unRsb}(s) - \text{median}({\mathrm{unRsb}})}{\mathrm {sd}({\mathrm{unRsb})}}
\end{equation*}
where $\text{median}(\mathrm{unRsb})$ and $\mathrm{sd}(\mathrm{unRsb})$ represent the median and standard deviation of *unRsb* computed over all analyzed markers. Note that for the sake of uniformity, we included the logarithm into the definition of the quantity while in the original article it remained outside.
However, we follow [@Tang2007] in using the median instead of the mean (hence unlike the definitions of *iHS* and *XP-EHH*). The authors argue that this might increase the robustness against different demographic scenarios.
Like *XP-EHH*, but unlike $iHS$, no binning with respect to allele frequencies is performed (see section \@ref(polarizing)).
As *iHS* (see section \@ref(ihs)), *Rsb* is constructed to have an approximately standard Gaussian distribution and may be transformed analogously into a p-value $p_\text{Rsb}$ (in a $-\log_{10}$ scale), either two-sided or one-sided. As for the latter, $p^\text{left}_\text{Rsb}$ identifies strong extended homozygosity in population $pop2$ relative to population $pop1$ and vice versa for $p^\text{right}_\text{Rsb}$.
### The function `ines2rsb()` {#ines2rsb}
The `ines2rsb()` function computes *Rsb* using two data frames containing each the *inES* statistics of one population as obtained by the `scan_hh()` function (see section \@ref(scanhh)).
As with *iHS* the *inES* values have to be calculated chromosome-wise and concatenated afterwords (for each population separately) like in the example of section \@ref(ihh2ihs). The *Rsb* values are computed for all markers present in both data frames. They are identified by chromosome name and position, not their identifiers, thus within each population the chromosomal positions must be unique.
Optionally, the allele frequencies of both populations can be taken over from the input data frames (if present there) to the output data frame by setting `include_freq` to `TRUE`.
For illustration, we take calculated and concatenated $inES$ values from a genome scan [@Gautier2011] provided as example data and compute for each SNP the *Rsb* between the CGU and EUT populations:
```{r ines2rsb2}
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
rsb.cgu_eut <- ines2rsb(scan_pop1 = wgscan.cgu,
scan_pop2 = wgscan.eut,
popname1 = "CGU",
popname2 = "EUT")
```
The resulting object `rsb.cgu_eut` is a data frame which shows for each marker its *Rsb* and associated p-value. The latter are by default two-sided, but a side can be chosen by setting argument `p.side` to `"left"` or `"right"`. The first rows of the resulting data frame are printed below:
```{r}
head(rsb.cgu_eut)
```
## The *XP-EHH* pairwise population statistic
### Definition
The *XP-EHH* (cross-population EHH) statistic [@Sabeti2007] is similar to *Rsb* except that it is based on *iES* instead of *inES* (see section \@ref(iES)). Hence, for or a given marker $s$, let $$\text{unXP-EHH}(s)=\ln\left(\frac{\mathrm{iES}_\text{pop1}(s)}{\mathrm{iES}_\text{pop2}(s)}\right)$$ represent the log-ratio of the *iES* values computed by the function `scan_hh()` in the populations $pop1$ and $pop2$ (see section \@ref(iES)).
The *XP-EHH* for a given focal marker is defined as the standardized *unXP-EHH* [@Sabeti2007]:
\begin{equation*}
\text{XP-EHH}(s)=\frac{\text{unXP-EHH}(s) - \text{mean}(\text{unXP-EHH})}{\text{sd}(\text{unXP-EHH})}
\end{equation*}
where $\text{mean}(\text{unXP-EHH})$ and $\text{sd}(\text{unXP-EHH})$ represent the mean and standard deviation of *unXP-EHH* computed over all analyzed markers.
As with *Rsb*, the information about the ancestral and derived status of alleles at the focal marker does not figure in the formula and no binning is performed.
As with the previous two scores *iHS* (section \@ref(ihs)) and *Rsb* (section \@ref(rsb)), *XP-EHH* is constructed to have an approximately standard Gaussian distribution and may be transformed analogously into a p-value $p_\text{XP-EHH}$ (in a $-\log_{10}$ scale), either two-sided or one-sided. Among the latter, $p^\text{left}_\text{XP-EHH}$ identifies strong extended homozygosity in population $pop2$ relative to population $pop1$ and vice versa for $p^\text{right}_\text{XP-EHH}$.
### The function `ies2xpehh()` {#ies2xpehh}
The `ies2xpehh()` function computes *XP-EHH* using two data frames, one for each population, containing the *iES* statistics as obtained by the `scan_hh()` function (see section \@ref(scanhh)).
As with *iHS* and *Rsb*, the *iES* have to be calculated chromosome-wise and concatenated afterwords (for each population separately) as has been shown in section \@ref(ihh2ihs). The *XP-EHH* values are computed for all markers present in both data frames. They are identified by chromosome name and position, not their identifiers, thus within each population the chromosomal positions must be unique.
Optionally, the allele frequencies in both populations can be taken over from the input data frames (if present there) to the output data frame by setting `include_freq` to `TRUE`.
For illustration, we take calculated and concatenated $iES$ values from a genome scan [@Gautier2011] and compute for each SNP the *XP-EHH* between the CGU and EUT populations:
```{r ies2xpehh2}
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
xpehh.cgu_eut <- ies2xpehh(scan_pop1 = wgscan.cgu,
scan_pop2 = wgscan.eut,
popname1 = "CGU",
popname2 = "EUT")
```
The resulting object `xpehh.cgu_eut` is a data frame containing for each SNP the *XP-EHH* and associated p-value. The latter are by default two-sided, but a side can be chosen by setting argument `p.side` to `"left"` or `"right"`.
The first rows are displayed below:
```{r}
head(xpehh.cgu_eut)
```
## Delineating regions with "outliers": the function `calc_candidate_regions()`
In contrast to neutral evolution, selection tends to produce clusters of markers with outlier values [@Tang2007]. For this reason, typically not (only) the markers with outlier scores are reported, but intervals with a conspicuous number or percentage of such markers. Although often similar ad-hoc procedures can be found in the literature, their respective parameters vary widely. We offer the function `calc_candidate_regions()` which scans through the genome in (possibly overlapping) sliding windows and identifies those as "candidates" which fulfill three conditions:
- a minimum number of markers (`min_n_mrk`)
- a minimum number of extremal markers (`min_n_extr_mrk`)
- a minimum percentage of extremal markers among the markers (`min_perc_extr_mrk`).
Markers with extreme values are those with a score greater than specified by parameter `threshold`. Apart from the values of the above conditions, the average of all marker scores and the average of extremal scores is reported for each candidate window.
By default, neighboring and overlapping candidate windows are joined together and the associated values recalculated. This can be turned off by setting `join_neighbors` to `FALSE`.
If `window_size` is set to 1, only the set of extremal markers is returned.
B default, the score itself is tested and reported. Toggling `pval` to `TRUE`, uses the associated p-value instead.
```{r candidate_regions}
cr.cgu <- calc_candidate_regions(wgscan.ihs.cgu,
threshold = 4,
pval = TRUE,
window_size = 1E6,
overlap = 1E5,
min_n_extr_mrk = 2)
cr.cgu
```
\clearpage
# Visualization of the statistics
## Un-standardized *iHS*: the function `freqbinplot()` {#freqbinplot}
Since mutations and recombinations tend to shorten stretches of homozygosity between chromosomes, an ancestral (older) allele is expected to have lower extended haplotype homozygosity than a derived (younger) allele. Hence the distribution of unstandardized $iHS=\ln(\frac{iHH^A}{iHH^D})$ is biased towards negative values. More importantly, there is a additional dependency on both ancestral and derived allele frequencies (in the bi-allelic case one is determined by the other). This can be seen by setting the output of the function `ihh2ihs` into the function `freqbinplot()` yielding Figure \@ref(fig:freqbin).
```{r freqbin, fig.align='center', fig.lp='fig:', fig.cap='Graphical output of the freqbinplot() function', fig.pos='!h'}
freqbinplot(wgscan.ihs.cgu)
```
Note that this frequency dependence is expected under neutral evolution and not due to selection. In fact, an implicit assumption of the bin-wise standardization is that each bin is dominated by neutral markers. Thus, the bin-wise standardization reduces the dependency of *iHS* on the allele frequency; see section \@ref(polarizing) of this vignette and Figure 4 in [@Voight2006] [In their case the effect is even more pronounced, because they polarized alleles with help of an outgroup *species* while in the case of our data it was done using extant ancestral *populations* as a proxy.]
## *Rsb* vs. *XP-EHH* comparison
Figure \@ref(fig:comp) shows a plot of *Rsb* against *XP-EHH* values across the CGU and EUT populations (see section \@ref(ines2rsb) and \@ref(ies2xpehh) respectively). Marked in red is the marker that was repeatedly used in the examples above. The plot was generated using the following R code:
```{r comp, echo = TRUE, fig.align = 'center', fig.lp = 'fig:', fig.cap = 'Comparison of Rsb and XP-EHH values across the CGU and EUT populations', fig.pos = "!h"}
plot(rsb.cgu_eut[, "RSB_CGU_EUT"],
xpehh.cgu_eut[, "XPEHH_CGU_EUT"],
xlab = "Rsb",
ylab = "XP-EHH",
pch = ".",
xlim = c(-7.5, 7.5),
ylim = c(-7.5, 7.5))
# add red circle for marker with name "F1205400"
points(rsb.cgu_eut["F1205400", "RSB_CGU_EUT"],
xpehh.cgu_eut["F1205400", "XPEHH_CGU_EUT"],
col = "red")
# add dashed diagonal
abline(a = 0, b = 1, lty = 2)
```
## Distribution of standardized values: the function `distribplot()`
The `distribplot()` function allows to visualize the distributions of the standardized scores (either *iHS*, *Rsb* or *XP-EHH*) and compare them to the standard Gaussian distribution. As an example, in Figure \@ref(fig:distribplot) the function was applied on the *iHS* estimates obtained for the CGU population (see section \@ref(ihs)):
```{r distribplot, fig.align = 'center', fig.lp = 'fig:', fig.cap = 'Graphical output of the distribplot() function', fig.pos="!h"}
distribplot(wgscan.ihs.cgu$ihs$IHS, xlab = "iHS")
```
Setting the option `qqplot` to `TRUE` toggles the output to a qqplot, as in Figure \@ref(fig:qqplot).
```{r qqplot, fig.align = 'center', fig.lp='fig:', fig.cap = 'Graphical output of the distribplot() function', fig.pos = "!h"}
distribplot(wgscan.ihs.cgu$ihs$IHS,
xlab = "iHS",
qqplot = TRUE)
```
## Genome wide score plots: the function `manhattanplot()`
The function `manhattanplot()` draws a Manhattan plot of the whole genome scores obtained by the functions `ihh2ihs()`, `ines2rsb()` or `ies2xpehh()`. Figure \@ref(fig:manhattanplot) was drawn using the command:
```{r manhattanplot, fig.align = 'center', fig.width = 7, fig.lp = 'fig:', fig.cap = 'Graphical output of the manhattanplot() function', fig.pos = '!h'}
manhattanplot(wgscan.ihs.cgu,
main = "iHS (CGU cattle breed)")
```
If the option `pval` is switched to `TRUE` the associated p-value instead of the score itself is plotted (Figure \@ref(fig:pval)).
```{r pval, fig.align = 'center', fig.width = 7, fig.lp = 'fig:', fig.cap = "Graphical output of the manhattanplot() function", fig.pos = '!h'}
manhattanplot(wgscan.ihs.cgu,
pval = TRUE,
threshold = 4,
main = "p-value of iHS (CGU cattle breed)")
```
The colors of the chromosomes in Figures \@ref(fig:manhattanplot) and \@ref(fig:pval) are the default colors of R as obtained by the command `palette()`. It is possible to change them by the same command. Note that the colors are associated with the order of chromosomes in the scan and not their order in the plot.
Candidate regions as obtained by the function `calc_candidate_regions()` can be added to the plot as parameter `cr`. Individual markers can be highlighted by setting argument `mrk` to a vector of marker IDs or a data frame with positions (containing columns with name `CHR` and `POSITION`).
By default, chromosomes are separated by an inset of 5,000,000 bases. This value can be increased by the corresponding parameter in order to further reduce overlap between data points of neighboring chromosomes.
In order to reduce the number of plotted data points, the data set can be rasterized in both dimensions by parameter `resolution`. The data points are then rounded to the specified resolution and duplicate points removed.
Furthermore, it is possible to specify a subset or a re-ordering of chromosomes with help of parameter `chr.name` as in Figure \@ref(fig:manhattanplotsub).
```{r manhattanplotsub, fig.align = 'center', fig.width = 7, fig.lp = 'fig:', fig.cap = 'Graphical output of the manhattanplot() function', fig.pos = '!h'}
# re-define colors
palette(c("red", "green"))
manhattanplot(wgscan.ihs.cgu,
pval = TRUE,
threshold = 4,
chr.name = c("1", "4", "5", "12"),
main = "iHS (CGU cattle breed)",
cr = cr.cgu,
mrk = "F1205400",
inset = 1E+7,
resolution = c(200000, 0.05))
# set back to default colors
palette("default")
```
## Genome wide score plots: the function `manhattan()` of package `qqman`
The package [qqman](https://cran.r-project.org/package=qqman) contains a function `manhattan()` which is similar to the function `manhattanplot()` of this package. The input data frame is expected to have a slightly different format, though. Hence, before plotting we need to "translate" our data as in the following example with *iHS* values:
```{r rehh2qqman}
# extract data frame from result list
ihs <- wgscan.ihs.cgu$ihs
# create new data frame
wgscan.cgu.ihs.qqman <- data.frame(
CHR = as.integer(factor(ihs$CHR,
levels = unique(ihs$CHR))),
# chromosomes as integers
BP = ihs$POSITION, # base pairs
P = 10**(-ihs$LOGPVALUE), # transform back to p-values
SNP = row.names(ihs) # SNP names
)
```
We can now use the new data frame as input for the function `manhattan()` to obtain Figure \@ref(fig:qqman):
```{r qqman, eval = requireNamespace("qqman", quietly = TRUE), fig.align = 'center', fig.width = 7, fig.lp = 'fig:', fig.cap = 'Graphical output of the qqman::manhattan() function', fig.pos = '!h', message = FALSE}
library(qqman)
qqman::manhattan(wgscan.cgu.ihs.qqman,
col = c("red","green"),
chrlabs = unique(ihs$CHR),
suggestiveline = 4,
highlight = "F1205400",
annotatePval = 0.0001)
```
<!-- \clearpage -->
# Visualization of the haplotype structure
## The function `plot.haplohh()`
Chunks of the haplotype matrix can be plotted, provided they are not too large, with sequences in rows and markers in columns. The following plot shows a chunk of 200 SNPs in the vicinity of SNP `F1205400` which is identified by the dashed line. The derived alleles are marked in red, the ancestral ones not shown.
```{r plothaplo, results = "hide", fig.align = 'center', fig.width = 7, fig.height = 5, fig.lp = 'fig:', fig.cap = 'Graphical output of the plot.haplohh() function', fig.pos = '!h'}
hh_subset <- subset(haplohh_cgu_bta12, select.mrk = 350:550)
oldpar <- par(mar = c(3, 2, 2, 2) + 0.1)
plot(
hh_subset,
mrk = "F1205400",
group_by_allele = TRUE,
ignore.distance = TRUE,
col = c(NA, "red"),
linecol = c("lightblue", "lightpink"),
mrk.col = "black",
cex = 0.1,
pos.lab.hap = "none",
pos.lab.mrk = "none"
)
par(oldpar)
```
The sequences are ordered with respect to the alleles of SNP `F1205400`: the lower two thirds of sequences contain the derived allele and are more homogenous around the SNP than the sequences carrying the ancestral allele.
\clearpage
## The functions `calc_furcation()` and `plot.furcation()`
Furcation structures portray the decay of haplotype homozygosity around a focal marker [@Sabeti2002]. They represent the complete information contained in the concept of extended shared haplotypes of which *EHH* and *EHHS* can be regarded as summary statistics. For each allele of a focal marker and each side ("left" or "right"), furcations arise when the extension step reaches a marker whose alleles distinguish hitherto identical haplotypes. In case of bi-allelic markers without missing values these can be only bifurcations. The furcation structure for a specific allele and a specific side is hence a tree with its root at the focal marker. A stark contrast between furcation structures of different core alleles should correspond to outlier values of *iHS*.
The function `calc_furcation()` calculates such a furcation structure around a focal marker.
```{r}
data(haplohh_cgu_bta12)
furcation <- calc_furcation(haplohh_cgu_bta12,
mrk = "F1205400")
```
The result is an object of class `furcation`, comprising a set of tree structures with some additional information, see `?"furcation-class"` for technical details.
The function `plot.furcation()` renders graphically the furcation object. Within the plot, the root (focal marker) is identified by a vertical dashed line.
The diagram is bi-directional, portraying decay along both sides of the focal marker. The thickness of the lines corresponds to the number of chromosomes sharing a haplotype.
As an illustration, Figure \@ref(fig:furc) shows the bifurcation diagrams for derived and ancestral allele of the marker `F1205400`.
```{r furc, fig.align = 'center', fig.width = 7, fig.height = 7.5, fig.lp = 'fig:', fig.cap = 'Graphical output of the plot.furcation() function', fig.pos = "!h"}
plot(furcation)
```
It is possible to zoom into furcations using parameter `xlim` and label the leaves using the parameter `hap.names`, as shown in Figure \@ref(fig:furczoom), produced by the following command. Labels in bold indicate that further chromosomes are associated with the corresponding branch.
```{r furczoom, fig.align = 'center', fig.width = 7, fig.height = 7.5, fig.lp = 'fig:', fig.cap = 'Graphical output of the plot.furcation() function', fig.pos = "!h"}
plot(furcation,
xlim = c(2.8E+7, 3E+7),
lwd = 0.05,
hap.names = hap.names(haplohh_cgu_bta12),
cex.lab = 0.3)
```
Individual trees can be exported into Newick format. The function `as.newick` produces a (possibly very long) string in this format. The following command extracts the furcation tree of the ancestral allele on the left side of the marker:
```{r}
newick <- as.newick(furcation,
allele = 0,
side = "left",
hap.names = hap.names(haplohh_cgu_bta12))
```
Such a string can be rendered graphically e.g. by the R package [ape](https://cran.r-project.org/package=ape) yielding Figure \@ref(fig:newick):
```{r newick, eval = requireNamespace("ape", quietly = TRUE), fig.align = 'center', fig.width = 6, fig.height = 6, fig.lp = 'fig:', fig.cap = 'Graphical output of the ape::plot.phylo() function', fig.pos = "!h"}
library(ape)
tree <- ape::read.tree(text = newick)
plot(tree,
cex = 0.5,
direction = "leftwards",
edge.color = "blue",
underscore = TRUE,
no.margin = TRUE)
```
## The functions `calc_haplen()` and `plot.haplen()`
To each haplotype in the sample the length of its longest shared haplotype is assigned, i.e. the range over which it is identical to at least one other haplotype (the latter might be different left and right to the focal marker). It corresponds to the maximal extension of the inner branches to both sides of the focal marker in a furcation diagram. The function `calc_haplen()` calculates this quantity:
```{r}
haplen <- calc_haplen(furcation)
```
The `haplen` object is a list with four elements:
1. `mrk.name`: the name/identifier of the focal marker
2. `position`: the position of the focal marker
3. `xlim`: the position of first and last marker in the data set
4. `haplen`: a data frame with begin and end positions of the extended shared haplotypes.
```{r}
haplen$mrk.name
```
```{r}
haplen$position
```
```{r}
haplen$xlim
```
```{r}
head(haplen$haplen)
```
Haplotype length is visualized in Figure \@ref(fig:haplo) obtained by the command