forked from ASPteaching/stats-ml-4metabolomics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3.1-StatisticsBackground.Rmd
1014 lines (632 loc) · 27.2 KB
/
3.1-StatisticsBackground.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'Targeted Metabolomics Data Analysis: Unlocking Insights with Machine Learning, AI and Statistics'
subtitle: 'Day 3 – Lecture 1'
author: "June 11-14, 2024"
institute: "Barcelona, Spain"
date: ""
output:
xaringan::moon_reader:
css: [default, metropolis, metropolis-fonts, "mycss.css"]
lib_dir: libs
nature:
ratio: '16:9'
highlightStyle: github
highlightLines: true
countIncrementalSlides: true
editor_options:
chunk_output_type: console
---
<style type="text/css">
.remark-slide-content {
font-size: 22px;
padding: 1em 4em 1em 4em;
}
.left-code {
color: #777;
width: 38%;
height: 92%;
float: left;
}
.right-plot {
width: 60%;
float: right;
padding-left: 1%;
}
</style>
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE, echo=FALSE,
message=FALSE,warning=FALSE,
fig.dim=c(4.8, 4.5), fig.retina=2, out.width="100%")
knitr::opts_chunk$set(echo = FALSE)
knitr::knit_hooks$set(mysize = function(before, options, envir) {
if (before)
return(options$size)
})
```
# Outline
.columnwide[
### 1) Introduction
### 2) Data preprocessing
### 3) Exploratory Analysis
### 4) On tests and P-values
### 5) The multiple testing problem
### 6) Summary and Conclusions
### 7) References and Resources
]
---
class: inverse, middle, center
name: Introduction
# Introduction and motivation
---
# Where we come from
```{r out.width="80%", fig.cap='From spectra and images to data tables'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_1.png")
```
---
# Where we are heading
```{r out.width="80%", fig.align='center', fig.cap='Making sense of the data with Statistical and Bioinformatical methods and tools.'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_2.png")
```
---
# Learning objectives
- Familiarise with the *omics data analysis process*
- Refresh Statistics backgrounders.
- Emphasize relevant aspects of tests (p-values, multiple tests)
- Learn about distinct approaches
- Description/Modelling,
- Univariate/Multivariate
- Statistics/Machine Learning
- Exploratory Unsupervised Statistical Methods
- Supervised Statistical and ML methods
---
# The Omics Data Analysis Process
```{r out.width="90%", fig.align='center', fig.cap='Omics technologies may differ in their data generation processes as well in the questions they aim at answering. The general steps, however, show common traits among technologies.'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_3.png")
```
---
# The Statistical Analysis Process
```{r out.width="85%", fig.align='center', fig.cap='Superimposed to the Omics Process there are a series of exploratory and modeling steps that, altogether, form the Statistical Analysis Process.'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_4.png")
```
---
# The Data for the Analysis
- At the end of the data generation process we end up with similar types of data such as *Peak intensities, Concentrations*, usually organized in some type of rectangular *features* $\times$ *samples* tables.
- These are, somehow, linked, to complementary information (sample groups, metabolite names ...), generally kown as the study *metadata*.
```{r out.width="60%", fig.align='center', fig.cap='A possible organization of data and metadata using an Excel spreadsheet'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_6.png")
```
---
# Metadata organization
- Typically *metadata* is intended to provide information on:
- Samples characteristics or experimental groups
- Variable names and other characteristics (e.g. is a metabolite *endogene* or *exogene*)
- Study characteristic such as the platform, technology or other data generation characteristics.
```{r out.width="80%", fig.align='center', fig.cap='Metadata can be provided as separate pieces of information.'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_7.png")
```
---
# Metadata management
- Some software libraries like [Bioconductor](http://bioconductor.org) provide specific structures to allow the integrated management of data and metadata.
- This is a powerful approach, but requires some level of expertise and is far from being universally adopted.
```{r out.width="45%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_9.png")
```
---
# Metadata integrated with data
- Metadata can be combined with features values in the same file.
- This simplifies data management, but may have a poorer metadata
```{r out.width="95%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_8.png")
```
---
# Omics data structure
- As with many other OMICS, Metabol*omics* data are high throughput.
- This, in practice, means often having more variables than samples.
- It also imposes some restrictions to the analysis methods that can be used on the data.
```{r out.width="70%", fig.align='center', fig.cap='Traditional datasets (left) have more samples (K) than variables (N), while omics datasets (right) may have more features than samples (K >> N)'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_5.png")
```
---
class: inverse, middle, center
name: DataPreprocessing
# Data pre-processing
---
# Data may not be that good
- Omics data are high throughput which, in practice, means there is a huge number of values to deal with.
- These values may be affected by the process that has generated them, which may have experienced unexpected issues.
- Even if nothing went wrong there may be **noise**, some values may be **missing**, others may be unexpectedly big or small (**outliers**).
---
# Quality control and preprocessing
.pull-left[
- The data usually undergo a series of iterative steps where they are
- checked,
- eventually adjusted to correct some detcted problem and
- re-checked to find out if the problem has been corrected.
- Here we don't go into pre-processing details, because they are discussed elsewhere.
- We only consider them in what they may affect statistical analysis.
]
.pull-right[
```{r out.width="100%", fig.align='center', fig.cap=''}
# knitr::include_graphics("images/1-StatisticsBackground_insertimage_11.png")
knitr::include_graphics("images/preprocessing_scheme.png")
```
]
---
# Missing values (MV)
.pull-left[
- Due to either signal truncation, failures in peak detection or true missings.
- Can lead to biased results, reduced statistical power, and invalid conclusions.
- MV are usually imputed as mean/mode imputation, k-nearest neighbors, or multiple imputation techniques.
- Failure to address can distort parameter estimates and increase type I/II errors.
]
.pull-right[
```{r out.width="100%", fig.align='center', fig.cap='Source: https://doi.org/10.1007/s11306-011-0366-4'}
# knitr::include_graphics("images/1-StatisticsBackground_insertimage_13.png")
knitr::include_graphics("images/1-StatisticsBackground_insertimage_13.png")
```
]
---
# Outlier Detection (OD)
.pull-left[
- Extreme values that don't seem to fit with the data
- Can skew results, affect mean/variance, and lead to misleading inferences.
- Detection through variability measures in normal or reduced dimensions.
- Treatment: remove, adjust, or analyze separately.
- Ignoring outliers can result in inflated errors and biased parameter estimates.
]
.pull-right[
```{r out.width="100%", fig.align='center', fig.cap='Univariate vs bivariate outlier detection'}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_14.png")
```
]
---
# Sample Normalization (SN)
.pull-left[
- Variation in data acquisition can introduce systematic biases, affecting comparability across samples.
- Normalization aim at making samples comparable while keeping the ability to detect eventual differences
- Many methods exist from simple median normalization to many elaborate approaches.
]
.pull-right[
<br>
```{r out.width="100%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_16.png")
```
Source: [Data normalization strategies in metabolomics](https://doi.org/10.1177/1469066720918446)
]
---
# Evaluate normalization effects
.pull-left[
- How can we know that normalization has produced the expected effect?
- Distinct criteria may lead to distinct choices but
- It is always a good idea to try to evaluate how well does any preprocessing method.
]
.pull-right[
```{r out.width="100%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_23.png")
```
Source: [NOREVA: normalization and evaluation of MS-based metabolomics data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5570188/)
]
---
# Data Transformations
.pull-left[
- Data may be skewed and/or heteroscedastic, violating certain test assumptions.
- Normalizing transformations (log, sqrt, etc.) may stabilize variance and approach normality.
- Proper transformation ensures that statistical assumptions are met, enhancing the validity of inferential statistics.
]
.pull-right[
<br>
```{r out.width="100%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_17.png")
```
Source: [Adaptive Box–Cox Transformation ... for Better Statistical Analysis](https://pubs.acs.org/doi/pdf/10.1021/acs.analchem.2c00503)
]
---
# Scaling
.pull-left[
- Data with different ranges and units, cause high-variance metabolites to dominate multivariate analyses.
- Solved by scaling the data using standardization (z-score) or unit variance scaling.
- Without scaling, statistical methods like PCA or clustering may be biased towards high-variance features, misrepresenting the true data structure.
]
.pull-right[
```{r out.width="90%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_19.png")
```
Source: [ Centering, scaling, and transformations: improving the biological information content of metabolomics data](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-7-142/tables/1)
]
---
# Centering, Scaling, Transformations
```{r out.width="100%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_20.png")
```
Source: [ Centering, scaling, and transformations: improving the biological information content of metabolomics data](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-7-142/tables/1)
---
# In practice ... (MetaboAnalyst)
```{r out.width="60%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_21.png")
```
---
# Preprocessing in summary
.pull-left[
- The output of preprocessing steps is a cleaner, more homogeneous dataset.
- This does not necessarily mean this is a better dataset!
- Steps should not be taken blindly, but carefully
- Their need and their effect must be assessed.
- When in doubt, leave things unchanged.
- Or ask your favourite bioinformatician.
]
.pull-right[
```{r out.width="80%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_22.png")
```
Source: [Metabolomics Data Normalization with EigenMS](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0116221)
]
---
class: inverse, middle, center
name: ExploratoryAnalysis
# Exploratory Analysis
---
# Exploratory Data Analysis (EDA)
.pull-left[
- It refers to any calculation or figure that provides information about a dataset.
- The first thing (sometimes the only thing) to do in a Data Analysis.
- It impregnates every step in the omics data analysis process
- Quality checks
- Data exploration
- Statistical test (check asumptions, visualize results)
- Statistical modeling (check assumptions, visualize model fit)
]
.pull-right[
```{r out.width="80%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_24.png")
```
]
---
# Univariate vs Multivariate EDA
.pull-left[
- Omics data are high dimensional (multivariate): 1 dimension/feature.
- Features are related to each other.
- This leads to consider multivariate statistics as the natural way to try to understand data structure
- The relation between features.
- The similarity between individuals.
- This is may be complex and not necessarily informative.
- It is usually complemented by uni and bi-variate exploration
]
.pull-right[
```{r out.width="30%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/eda_graphs.png")
```
]
---
# Initial Data Examination
.pull-left[
- Start any statistical analysis by looking at the data
- [Which/How many/What type of] variables,
- [Which/How many] samples
- Are there some/many missing values?
- Obtain simple summary statistics and plots
- Visualize the data, try to get a grasp of variables and individuals
]
.pull-right[
```{r out.width="60%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_25.png")
```
]
---
# Numerical summaries
- If the number of features is not huge, they provide useful information
- Location estimates: Mean, Median, Mode, quantiles
- Dispersion estimates: Variance, Standard deviation, interquartile range
- For relation between variables we can estimate the covariance between two variables which measures their degree of linear association.
- Covariance depends on the units of the variables.
- Use the correlation coefficent to have a unitless scale.
- Always consider Pearson and Spearman (Ranks) coefficient.
- Other measures are also useful, but less used
- E.g. Skewness is a measure of the simmetry of the variable. Values outside [-1, +1] suggest skewed distributions.
---
# A toy dataset
```{r out.width="90%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_28.png")
```
---
# Numerical summaries
```{r out.width="80%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_32.png")
```
---
# Don't confuse features with observations!
- In classical statistics we always think of summarizing or plotting variables.
- In omics data it is common to summarize/plot samples
- It can be done, it may be useful
- But it does not mean the same.
- The lines below show the mean an sd of each sample
- The columns on the right show the mean an sd of each variable.
- They are different. They don't mean the same!
```{r out.width="60%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_29.png")
```
---
# The *king* of graphical summaries
- Boxplots provide a condensed representation of a distribution, based on quartiles and outliers.
- Being a flat representation, it facilitates comparisons.
```{r out.width="100%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_31.png")
```
---
# Again, they are not the same!
```{r out.width="90%", fig.align='center', fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground_insertimage_30.png")
```
---
# From uni/bivariate to multivariate EDA
- Omics data are high dimensional
- Because of this, it makes full sense to try to look *at all variables at once*
- Although possible in moderate settings, it is generally very difficult to extract useful information.
- An alternative is to rely on dimension reduction methods, that retain some/most information and project it to lower dimensional spaces where this can be visualized.
- For this, we will go through a series of Multivariate Exploratory (Unsupervised) techniques.
![illustrative example of dimension reduction.](images/1-StatisticsBackground_insertimage_33.png)
---
class: inverse, middle, center
name: StatisticalTests
# On tests and p-values
---
# The *Class comparison* problem
- Main goal: Identifying significantly different features
- Identify features whose values are (significantly)
associated with different conditions
- Treatment, cell type,... (qualitative covariates)
- Dose, time, ... (quantitative covariate)
- Survival, infection time,... !
- Estimate effects/differences between groups,
- either directly: $D = Y - X$ or
- in a log scale (using ratios): $log(X)-log(Y) = log(X/Y)$.
---
# What is a “significant change”?
.pull-left[
- Depends on the variability
within groups.
- Variability, of course, may be
different from feature to
feature.
- To assess the statistical
significance of observed differences,
a statistical test is usually condected *for
each feature*.
- There also exist multivariate tests to make all comparisons at one, but the nature of the data usually makes them unfeasible.
]
.pull-right[
![Plot title. ](images/1-StatisticsBackground-2_insertimage_1.png)
]
---
# Different settings for statistical tests (1)
- **Indirect comparisons**: 2 groups, 2 samples, unpaired
- E.g. **10** individuals: 5 suffer diabetes, 5 healthy
- One sample for each individual
- Typically: Two sample t-test or similar
```{r out.width="100%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_3.png")
```
---
# Different settings for statistical tests (2)
- ** Direct comparisons:** Two groups, two samples, **paired**
- E.g. 6 individuals with brain stroke.
- Two samples from each: one from healthy (region 1) and one
from affected (region 2). That is a total of 2*6 = 12 samples
- Typically: One sample t-test (also called paired t-test) or
similar, based on the individual differences between
conditions.
```{r out.width="80%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_4.png")
```
---
# Some issues in feature selection
- Each technology’s data may have peculiarities that
have to be dealt with.
- Some related with small sample sizes
- Variance unstability
- Non-normality of the data
- Other related to big number of variables
- Multiple testing
---
# Variance unstability
- Can we trust average effect sizes (average
difference of means) alone?
- Can we trust the t-statistic alone?
- Here is evidence that the answer is no.
```{r out.width="80%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_5.png")
```
---
# Variance unstability (1): outliers
- Can we trust average effect sizes (average
difference of means) alone?
- Can we trust the t statistic alone?
- Here is evidence that the answer is no.
```{r out.width="80%", fig.cap='Averages can be driven by outliers'}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_6.png")
```
---
# Variance unstability (2): tiny variances
- Can we trust average effect sizes (average
difference of means) alone?
- Can we trust the t statistic alone?
- Here is evidence that the answer is no.
```{r out.width="80%", fig.cap='t-values can be driven by tiny variances'}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_7.png")
```
---
# Solutions: Adapt t-tests
.pull-left[
- A standard solution: Combine
- Local estimates of variability, $SE_g$ (based on individual features)
- With global estimates, $SE$, (based on all features together)
- This results in *moderated estimators* that account simultaneosuly for
- The variability of individual features
- And that of all features together.
]
.pull-right[
```{r out.width="100%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_8.png")
```
]
---
# Up to here…:
- Can we generate a list of candidate features?
- With the tools we have, the reasonable steps to generate a list of candidate features may be:
```{r out.width="80%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_9.png")
```
- We need to be able to figure how significant are these values.
- Traditional, somehow polemic, approach:
- Assign them p-values
- Use these to select those features to be retained (* But see later*)
---
# Nominal p-values
- After a test statistic is computed, it is convenient to convert it to
a p-value:
- It is defined as *The probability that a test statistic, say $S(X)$, takes values equal or greater than the observed value, say $X_0$, under the assumption that the null hypothesis is true
$$
p=P\{S(X)>=S(X_0)|H_0 \mbox{ true}\}
$$
---
# Significance testing
- Test of significance at the $\alpha$ level:
- Reject the null hypothesis if your p-value is smaller
than the significance level
- It has advantages but not free from criticisms
- Features with p-values falling below a prescribed
level may be regarded as significant
- As we know, depending on what the truth is this can lead to
- Two type of correct decisions
- Two possible types of errors
---
# Hypothesis testing overview
```{r out.width="100%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_10.png")
```
---
# Calculation of p-values (1)
- Standard methods for calculating p-values is to use tabulated p-values *for the distribution that the test statistic is assumed to follow*.
- This, however, maybe harder to check than one would expect.
- In the toy dataset, where each variable has onle 6 observations the normality assumption is impossible to check!
- If sample size is bigger it may be possible to do some goodness of fit test, but, it should be done carefully and for all features in the dataset.
- This kind of checks are usually omitted and the *validity conditions are assumed to be true*
- For some distributions, where there is robustness to departure of assumptions, it may work in a wide range of conditions.
- It may be a good idea to look for alternatives.
---
# Calculation of p-values (2) Permutations tests
- Permutation tests are a good alternative to parametric, or even non-parametric tests.
- Based on data shuffling. No assumptions (only *exchangeability* is required.)
- Relatively simple to understand and implement. They are based on
- Random interchange of labels between samples
- Estimate p-values is based on the approximate permutation distribution of the test statistic.
---
# Permutation tests algorithm
- Repeat, for each feature $\mathbf{xi},\, i=1,...N$:
- For every possible permutation $1,...B$ of its observations
- Permute the $n$ data points for that feature.
- Design first $n_1$ as "treatments", the second $n_2$ as "controls"
- Calculate the corresponding two sample test statistic, $t_b$
- After all the B permutations are done approximete the $p$-value by:
$$p =\frac{\# bº,:\, |t_b| ≥ |t_{observed}|}{B}$$
- Notice that **all these steps have to be performed for all features**,
- that is, permutation tests are *computationally intensive*!
---
# Permutation tests (2)
```{r out.width="100%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_11.png")
```
---
# The volcano plot: fold change vs -log(pvalue)
```{r out.width="100%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_12.png")
```
---
class: inverse, middle, center
name: MultipleTesting
# Multiple testing
---
# The Multiple Testing problem
- Whatever approach we use to detect significant differences in features there is a common characteristic:
- *Every test is applied to every feature set in a long collection of features*
- This leads to a *multiple testing problem*:
- As the number of tests increases
- The probability of observing at least one false positive is also going to increase
- In order to avoid an artificial inflation of *False positive discoveries* some adjustment (also called "corrections") are recommended.
---
# Why multiple testing matters in omics
- The probability of observing one false positive if testing once is:
- P(Making a type I error) = $\alpha$
- P(not making a type I error) = $1-\alpha$
- Now imagine we perform m tests independently
- P(not making a type I error in $m$ tests) = $(1-\alpha)^m$
- P(making at least a type I error in $m$ tests) = $1-(1-\alpha)^m$
As $m$ increases the probability of having at least one type error tends to increase
---
# Type I error is not useful in omics
![](images/1-StatisticsBackground-2_insertimage_13.png)
---
# How can we deal with this issue?
- Controlling for type I error is not feasible if many tests.
- There are distinct strategies to deal with it:
1. *Extend the idea of type I error*: FWER and FDR are two extensions that , somehow, modify the error rate with the aim aof providing a "global" control of error probability.
2. *Look for procedures that control the probability of error for these extended error types*: Mainly, this means adjusting raw p-values.
- AN ANALOGY: Indiana Jone's bridge
- *Would you cross a bridge once if the probability that it broke down is 0.001?*
- *Would you cross it 10.000 times?*
- *What would you do if you decided not to cross that bridge?*
---
# Error rate extensions and p-value adjustments
- Family Wise Error Rate (FWER)
- FWER is the probability of observing, at least, one false positive
- False Discovery Rate (FDR)
- False Discovery Rate is the *expected value of proportion of false positives* among rejected null hypotheses.
- Each type of error rate can be associated with distinct types of p-value adjustments
- Bonferroni method is used to provide control of FWER
- Benjamini-Hochberg (q-value) is used to provide control of FDR.
---
# Difference between FWER and FDR
- FWER Controls for no (0) false positives
- Controlling FWER yields fewer features (false positives),
- but you are likely to miss many.
- FWR is adequate if goal is to identify few features that differ between two groups.
- FDR Controls the proportion of false positives
- If you can tolerate more false positives
- you will get many fewer false negatives
- Adequate if goal is to pursue the study e.g. to determine functional relationships among features.
---
# Steps to generate a list of candidate features (2)
```{r out.width="90%", fig.cap=''}
knitr::include_graphics("images/1-StatisticsBackground-2_insertimage_14.png")
```
---
# An example
- A list of 63 potentially significant p-values has been adjuested using Bonferroni and BH
- BH is clearly more restrictive than BH, which, however is more restrictive than the raw p-values.
```{r echo=FALSE}
cachexia.t_test <- read.csv("datasets/cachexia-t_test_all.csv", row.names=1)
rawPs <- cachexia.t_test$p.value
names(rawPs) <- row.names(cachexia.t_test)
bonfP <- p.adjust(rawPs, method= c("bonferroni"))
bhP <- p.adjust(rawPs, method= c("BH"))
pVals <-data.frame(raw = rawPs, Bonferroni=bonfP, FDR = bhP)
Ordered <-round(pVals[order(pVals$raw),] ,6)
```
<small>
.pull-left[
```{r echo=FALSE}
kableExtra::kable(Ordered[1:8,])
```
P-values at the top of the table
]
.pull-right[
```{r echo=FALSE}
kableExtra::kable(Ordered[41:48,])
```
P-values at the bottom of the table