-
Notifications
You must be signed in to change notification settings - Fork 53
/
Copy pathappendix.Rmd
1828 lines (1205 loc) · 98.4 KB
/
appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (APPENDIX) Appendix {-}
# Math and linear algebra
## Norms, distances, trace, inequalities
Take a look at [this](https://www.math.usm.edu/lambers/mat610/sum10/lecture2.pdf), and [this](http://www-math.mit.edu/~djk/calculus_beginners/chapter03/section03.html).
Just remember that for a complez number $z$, $z\overline{z} = a^2 + b^2= |z|^2$, and $\frac{1}{z} = \frac{\overline{z}}{\overline{z}z}$, and the conjugation is the mapping $z = (a+ib) \mapsto \overline{z} = (a-ib)$.
```{definition, name="Inner product"}
A function $f: A \times B \mapsto \mathbb{C}$ is called an inner product $(a,b) \mapsto z$ if:
- $(a,b+c) = (a,b)+(a,c)$ (and respectively for $(a+c,b)$ )
- $(a,\alpha b) = \alpha (a,b)$
```
```{definition, name="Norm"}
A function $f : \mathbb{R}^d \mapsto \mathbb{R}$ is called a norm if:
- $\|x\| \geq 0 \forall x \in \mathbb{R}^{d}$, also $\|x\|=0$ iff $x=0$ (positive definiteness)
- $\|\alpha x\| = \alpha \|x\| \forall x \in \mathbb{R}^{d}$ and $\forall \alpha \in \mathbb{R}$ (positively homogeneous)
- $\|x\|-\|y\| \leq \|x + y\| \leq \|x\| + \|y\|$ (triangle inequality).
```
Note that along with triangle inequality you might also need to know the [reverse triangle inequality](https://www.quora.com/What-is-an-intuitive-explanation-of-the-reverse-triangle-inequality)
The triangle inequality is basically the [subadditivity](https://en.wikipedia.org/wiki/Subadditivity) property of the norm. It is simople to see that norms are **not** linear operators.
```{theorem, name="Cauchy(-Bunyakovsky)-Schwarz"}
$$|(x,y)| \leq \|x\| \|y\|$$
```
```{proof}
Note that by taking the square on both sides we get: $(x,y)^2 \leq (x,x)(y,y)$.
Substituting $(x,y)=\|x\| \|y\| cos(\theta)$, we get:
$$|\|x\|^2\|y\|^2 cos^2(\theta) | \leq (x,x)(y,y)$$
The inequality follows from noting that $|cos(\theta)|$ is always $\leq 1$.
```
```{remark}
It is simple to see - using Cauchy-Schwarz - that for a vector $x$ we have that:
$$\|x\|_1 \leq \sqrt{n} \|x\|_2 $$
```
<!-- \paragraph{Matrix norms} -->
<!-- \begin{itemize} -->
<!-- \item All properties of vector norms .. -->
<!-- \item Submultiplicativity -->
<!-- \item -->
<!-- \end{itemize} -->
We will use the following matrix norms:
- $\|A\|_0$ as the number of non-zero elements of the matrix $A$,
- $\|A\|_1 = \max\limits_{1 \leq j \leq n} \sum_{i=0}^n |a_{ij}|$ is the maximum among the sum of the absolute value along the columns of the matrix,
- $\|A\|_2 = \|A\| = \sigma_1$ is the biggest singular value of the matrix,
- $\|A\|_\infty = \max\limits_{1 \leq i \leq m} \sum_{j=0}^n |a_{ij}|$ is the maximum among the sum of the absolute values along the rows of the matrix,
- $\norm{A}_{\max}$ is the maximal element of the matrix in absolute value.
- $\norm{A}_F$ is the Frobenius norm of the matrix, defined as $\sqrt{\sum_{ij}a_{ij}^2}$
Note that for symmetric matrices, $\|A\|_\infty = \|A\|_1$.
<!-- from dequantization of qsfa -->
```{exercise, name="bound error on product of matrices"}
Suppose that $\|A - \overline{A}\|_F \leq \epsilon \|A\|_F$.
Bound $\|A^TA - \overline{A}^T\overline{A}\|_F$
```
<!-- ```{proof} -->
<!-- $$A^TA + \overline{A}^TA - \overline{A}^TA - \overline{A}^T\overline{A}=$$ -->
<!-- $$( A^TA + \overline{A}^TA - \overline{A}^TA - \overline{A}^T\overline{A}=$$ -->
<!-- ``` -->
```{definition, distance, name="Distance"}
A function $f : \mathbb{R}^d \times \mathbb{R}^d \mapsto \mathbb{R}$ is called a distance if:
- $d(x,y) \geq 0$
- $d(x,y) = 0$ iif $x=y$
- $d(x,y)=d(y,x)$
- $d(x,z) \leq d(x,y) + d(y,z)$
```
```{definition, convexity-concavity, name="Convex and concave function"}
A function $f$ defined on a convex vector space $D$ is said to be *concave* if, $\forall \lambda \in [0,1]$, and $\forall x,y \in D$:
$$f\left( (1-\alpha)x + \alpha y \right) \geq (1-\alpha) f(x) + \alpha f(y)$$
Conversely, a function $f$ defined on a convex vector space $D$ is said to be *convex* if, $\forall \lambda \in [0,1]$, and $\forall x,y \in D$:
$$f\left( (1-\alpha)x + \alpha y \right) \leq (1-\alpha) f(x) + \alpha f(y)$$
```
#### Properties of the trace operator
- $Tr[A+B] = Tr[A] + Tr[B]$
- $Tr[A\otimes B] = Tr[A]Tr[B]$
- $Tr_1[A\otimes B] = Tr[A]B$
- $Tr[\ket{a}\bra{b}] = \braket{a|b}$
- $Tr[AB] = \langle A, B \rangle$
where inner product between matrices is basically defined pointwise as $\sum_{ij}a_{ij}b_{ij}$
```{exercise}
Can you show that the last identity is true?
```
#### Properties of tensor product
Given two liner maps $V_1 : W_1 \mapsto V_1$ and $V_2 : W_2 \mapsto V_2$ we define the tensor product as the linear map:
$$V_1 \otimes V_2 : V_1 \otimes V_2 \mapsto W_1 \otimes W_2 $$
- $\alpha v \otimes w = v \otimes \alpha w = \alpha(v \otimes w)$
- $( v_1 + v_2) \otimes w = (v_1 \otimes w) + (v_2 \otimes w)$ (and the symmetric of it)
- $\ket{\psi_1}\bra{\phi_1} \otimes \ket{\psi_2}\bra{\phi_2} = \ket{\psi_1}\ket{\psi_2} \otimes \bra{\phi_1}\bra{\phi_2}$ <!-- improve, check.. -->
When a basis is decided for representing linear maps between vector spaces, the tensor product becomes the Kroeneker product.
#### Useful inequalities
```{r, amgm-images, fig.cap="AMGM inequality from wikipedia and twitter", echo=FALSE}
knitr::include_graphics("images/amgm1.png")
knitr::include_graphics("images/amgm2.png")
```
```{theorem, binomial-theorem, name="Binomial theorem"}
$$(a+b)^n = \sum_{k=0}^n {n \choose k} a^kb^{n-k}$$
```
## Linear algebra
### Eigenvalues, eigenvectors and eigendecomposition of a matrix
Real matrices are important tools in Machine Learning as they allow to comfortably represent data and describe the operations to perform during an algorithm. Eigenvectors and eigenvalues are fundamental linear algebra concepts that provide important information about a matrix.
(ref:linearalgebragilbert) [@linearalgebragilbert]
```{definition, name="Eigenvalues and Eigenvectors (Section 6.1 page 289 from (ref:linearalgebragilbert) )"}
Let $A$ be a $\mathbb{R}^{n\times n}$ square matrix, $q \in \mathbb{R}^{n}$ a non-zero vector and $\lambda$ a scalar. If the following equation is satisfied
$$Aq = \lambda q,$$
then $q$ is said to be an eigenvector of matrix $A$ and $\lambda$ is its associated eigenvalue.
```
To have a geometric insight into what eigenvectors and eigenvalues are, we can think of any matrix as a linear transformation in the $\mathbb{R}^n$ space. Under this light, we can say that the eigenvectors of a matrix are those vectors of the space that, after the transformation, lie on their original direction and only get their magnitude scaled by a certain factor: the eigenvalue.
The eigenvalues reveal interesting properties of a matrix. For example, the trace of a matrix (i.e. the sum of the element along the main diagonal of a square matrix) is the sum of its eigenvalues
$$Tr[A] = \sum_i^{n}\lambda_i,$$
and its determinant is equal to the product of the eigenvalues (Section 6.1 page 294 from [@linearalgebragilbert])
$$det(A) = \prod_i^{n}\lambda_i.$$
Moreover, a matrix $A$ with eigenvalues $\{\lambda_1, ..., \lambda_k\}$ has an inverse only if all the eigenvalues are not zero. The inverse has eigenvalues $\{\frac{1}{\lambda_1}, ..., \frac{1}{\lambda_k}\}$.
Generally, one eigenvalue can be associated with multiple eigenvectors. There might be a set of vectors $E(\lambda) \subseteq \mathbb{R}^n$ such that for all those vectors $q \in E(\lambda): Aq = \lambda q$. That is why for each eigenvalue we talk about an eigenspace.
(ref:scapellato) [@scapellato]
```{definition, eigenspace, name="Eigenspace (Definition 7.1.5 page 108 (ref:scapellato) )"}
Let $A$ be a $\mathbb{R}^{n\times n}$ square matrix and $\lambda$ be an eigenvalue of $A$. The eigenspace of $A$ related to $\lambda$ is the space defined over the set of vectors $E(\lambda) = \{ x: Ax = \lambda x\}$.
```
For each eigenspace, through the Gram-Schmidt procedure, starting from linearly independent vectors it is possible to identify a set of orthogonal eigenvectors that constitute a basis for the space. The basis that spans the space where all the eigenvectors of a matrix lie is called eigenbasis.
```{definition, name="Eigenbasis"}
A basis for the space where all the eigenvectors of a matrix lie is called eigenbasis.
```
An important result is that vectors in different eigenspaces are linearly independent.
```{lemma, name="Linear independence of eigenvectors (Lemma 7.2.3 page 112 from (ref:scapellato) )"}
The set of vectors obtained by the union of the bases of the eigenspaces of a matrix is linearly independent.
```
This means that if the sum of the dimensions of the eigenspaces $\sum_i dim(E(\lambda_i))$ equals $n$, it is possible to find $n$ eigenvectors of $A$ that form a basis for the $\mathbb{R}^n$ space. If that is the case, each vector that lies in $\mathbb{R}^n$ can be written as a linear combination of the eigenvectors of $A$.
Interestingly, matrices that have $n$ linearly independent eigenvectors can be decomposed in terms of their eigenvalues and eigenvectors.
```{theorem, name="Eigendecomposition or Diagonalization"}
\cite[Section 6.2 page 304]{linearalgebragilbert}
Let $A \in \mathbb{R}^{n \times n}$ be a square matrix with $n$ linearly independent eigenvectors. Then, it is possible to decompose the matrix as
$$A = Q\Lambda Q^{-1}.$$
Where $Q \in \mathbb{R}^{n\times n}$ is an square matrix and $\Lambda \in \mathbb{R}^{n\times n}$ is a diagonal matrix. In particular, each $i^{th}$ column of $Q$ is an eigenvector of $A$ and the $i^{th}$ entry of $\Lambda$ is its associated eigenvalue.
```
The matrices that can be eigendecomposed are also said diagonalizable, as in practice the theorem above states that such matrices are similar to diagonal matrices. Unfortunately, not all the square matrices have enough independent eigenvectors to be diagonalized. The Spectral Theorem provides us with a set of matrices that can always be eigendecomposed.
```{theorem, spectral-theorem, name="Spectral theorem"}
\cite[Spectral Theorem page 339]{linearalgebragilbert}
Every symmetric matrix is diagonalizable $A = Q\Lambda Q^{-1}$. Furthermore, its eigenvalues are real and it is possible to choose the columns of $Q$ so that it is an orthogonal matrix.
```
Recall that a matrix $Q$ is said to be orthogonal if $QQ^T=Q^TQ = \mathbb{I}$, therefore $Q^{-1} = Q^T$.
The Spectral theorem, together with the fact that matrices like $A^TA$ and $AA^T$ are symmetric, will come in handy in later discussions.
Being able to eigendecompose a matrix allows performing some computations faster than otherwise. Some examples of operations that gain speed-ups from the eigendecomposed representation are matrix inversion and matrix exponentiation. Indeed, if we have a matrix $A=Q\Lambda Q^{-1}$ its inverse can be computed as $A^{-1}=Q\Lambda^{-1}Q^{-1}$
where $\Lambda^{-1} = diag([\frac{1}{\lambda_1}, ... \frac{1}{\lambda_n}])$. It is easy to check that this matrix is the inverse of $A$:
$$AA^{-1} = (Q\Lambda Q^{-1})(Q\Lambda^{-1}Q^{-1}) = Q\Lambda\Lambda^{-1}Q^{-1} = QQ^{-1} = \mathbb{I}$$
$$A^{-1}A = (Q\Lambda^{-1}Q^{-1})(Q\Lambda Q^{-1}) = Q\Lambda^{-1}\Lambda Q^{-1} = QQ^{-1} = \mathbb{I}.$$
At the same time, the eigendecomposition of a matrix allows performing matrix exponentiation much faster than through the usual matrix multiplication. In fact, it is true that $A^p = Q\Lambda^pQ^{-1}$.
For instance,
$$A^3 = (Q\Lambda Q^{-1})(Q\Lambda Q^{-1})(Q\Lambda Q^{-1}) = Q\Lambda(Q^{-1}Q)\Lambda(Q^{-1}Q)\Lambda Q^{-1} = Q\Lambda\Lambda\Lambda Q^{-1} = Q\Lambda^3Q^{-1}.$$
Computing big matrix powers such as $A^{100}$, with its eigendecomposed representation, only takes two matrix multiplications instead of a hundred.
Traditionally, the computational effort of performing the eigendecomposition of a $\mathbb{R}^{n \times n}$ matrix is in the order of $O(n^3)$ and may become prohibitive for large matrices [@partridge1997fast].
### Singular value decomposition
<!--
# TODO Adjust citations and labels
-->
Eigenvalues and eigenvectors can be computed only on square matrices. Moreover, not all matrices can be eigendecomposed. For this reason, we introduce the concepts of singular values and singular vectors, that are closely related to the ones of eigenvalues and eigenvectors, and offer a decomposition for all the kind of matrices.
```{theorem, svd, name="Singular Value Decomposition (Sections 7.1, 7.2 from [@linearalgebragilbert])"}
\cite[Sections 7.1, 7.2]{linearalgebragilbert}
Any matrix $A \in \mathbb{R}^{n \times n}$ can be decomposed as
$$A = U\Sigma V^T$$
where $U \in \mathbb{R}^{n\times r}$ and $V \in \mathbb{R}^{m\times r}$ are orthogonal matrices and $\Sigma \in \mathbb{R}^{r\times r}$ is a diagonal matrix. In particular, each $i^{th}$ column of $U$ and $V$ are respectively called left and right singular vectors of $A$ and the $i^{th}$ entry of $\Sigma$ is their associated singular value. Furthermore, $r$ is a natural number smaller then $m$ and $n$.
```
Another (equivaloent) definition of SVD is the following:
$$A=(U, U_0)\begin{pmatrix}
\Sigma & 0 \\
0 & 0
\end{pmatrix} (V, V_0)^T.$$
The matrix $\Sigma$ is a diagonal matrix with $\Sigma_{ii}=\sigma_i$ being the singular values (which we assume to be sorted $\sigma_1 \geq \dots \geq \sigma_n$).
The matrices $(U, U_0)$ and $(V, V_0)$ are unitary matrices, which contain a basis for the column and the row space (respectively $U$ and $V$) and the left null-space and right null-space (respectively $U_0$ and $V_0$). Oftentimes, it is simpler to define the SVD of a matrix by simply discarding the left and right null spaces, as $A=U\Sigma V^T$, where $U,V$ are orthogonal matrices and
$\Sigma \in \mathbb{R}^{r \times r}$ is a diagonal matrix with real elements, as we did in Theorem \@ref(thm:svd).
Similarly to how eigenvalues and eigenvectors have been defined previously, for each pair of left-right singular vector, and the associated singular value, the following equation stands:
$$Av = \sigma u.$$
If we consider the Singular Value Decomposition (SVD) under a geometric perspective, we can see any linear transformation as the result of a rotation, a scaling, and another rotation. Indeed, if we compute the product between a matrix $A \in \mathbb{R}^{n\times m}$ and a vector $x \in \mathbb{R}^m$
$$Ax = U\Sigma V^Tx = (U(\Sigma (V^Tx))).$$
$U$ and $V^T$, being orthogonal matrices, only rotate the vector without changing its magnitude, while $\Sigma$, being a diagonal matrix, alters its length.
It is interesting to note that the singular values of $A$ - denoted as $\{\sigma_1,..., \sigma_r\}$ - are the square roots $\{\sqrt{\lambda_1},..., \sqrt{\lambda_r}\}$ of the eigenvalues of $AA^T$ (or $A^TA$) and that the left and right singular vectors of $A$ - denoted as $\{u_1, ..., u_r\}$ and $\{v_1, ..., v_r\}$ - are respectively the eigenvectors of $AA^T$ and $A^TA$.
The fact that each matrix can be decomposed in terms of its singular vectors and singular values, as in the theorem above, makes the relationship between singular values - singular vectors of a matrix and eigenvalues - eigenvectors of its products with the transpose clearer:
$$AA^T = (U\Sigma V^T)(U\Sigma V^T)^T = U\Sigma V^TV\Sigma U^T = U\Sigma ^2U^T;$$
$$A^TA =
(U\Sigma V^T)^T(U\Sigma V^T) =
V\Sigma U^TU\Sigma V^T = V\Sigma ^2V^T.$$
Note that the matrices $AA^T$ and $A^TA$ are symmetric matrices and so, for the Spectral theorem, we can always find an eigendecomposition. Moreover, note that they have positive eigenvalues: being the square roots of real positive eigenvalues, the singular values of a real matrix are always real positive numbers.
As the left and right singular vectors are eigenvectors of symmetric matrices, they can be chosen to be orthogonal as well. In particular, the left singular vectors of a matrix span the row space of the matrix, and the right singular vectors span the column space.
```{definition, column_row_space, name="Column (row) Space (Definition 8.1 page 192 [@schlesinger])"}
\cite[Definition 8.1 page 192]{schlesinger} Let $A$ be a $\mathbb{R}^{n\times m}$ matrix. The column (row) space of $A$ is the space spanned by the column (row) vectors of $A$. Its dimension is equal to the number of linearly independent columns (rows) of $A$.
```
The number $r$ of singular values and singular vectors of a matrix is its rank.
```{definition, rank, name="Rank of a matrix (Definition 8.3, Proposition 8.4 page 193-194 from [@schlesinger]) "}
The rank of a matrix is the number of linearly independent rows/columns of the matrix. If the matrix belongs to the $\mathbb{R}^{n\times m}$ space, the rank is less or equal than $min(n,m)$. A matrix is said to be full rank if its rank is equal to $min(n,m)$.
```
The dimension of the null-space is the number of linearly-dependent columns. For a rank $k$ matrix, the Moore-Penrose pseudo-inverse is defined as $\sum_{i}^k \frac{1}{\sigma_i}u_i v_i^T$. Another relevant property of SVD is that the nonzero singular values and the corresponding singular vectors are the nonzero eigenvalues and eigenvectors of the matrix
$\begin{pmatrix}
0 & A \\
A^T & 0
\end{pmatrix}$:
$$ \begin{pmatrix}
0 & A \\
A^T & 0
\end{pmatrix} \begin{pmatrix}
u_i \\
v_i
\end{pmatrix} .
= s_i \begin{pmatrix}
u_i \\
v_i
\end{pmatrix}
$$
With $s(A)$ or simply with $s$ we denote the sparsity, that is, the maximum number of non-zero elements of the rows.
### Singular vectors for data representation
<!--
# TODO Adjust citations and labels
-->
Singular values and singular vectors provide important information about matrices and allow to speed up certain kind of calculations. Many data analysis algorithms, such as Principal Component Analysis, Correspondence Analysis, and Latent Semantic Analysis that will be further investigated in the following sections, are based on the singular value decomposition of a matrix.
To begin with, the SVD representation of a matrix allows us to better understand some matrix norms, like the spectral norm and the Frobenius norm [@linearalgebragilbert].
::: {.definition #spectral_norm name="l2 (or Spectral) norm"}
Let $A \in \mathbb{R}^{n\times m}$ be a matrix. The $l_2$ norm of $A$ is defined as $\|A\|_2 = max_{x }\frac{\|A x \|}{\|x\|}$.
:::
It is pretty easy to see that $\|A\|_2 = \sigma_{max}$, where $\sigma_{max}$ is the greatest singular value of $A$. In particular, if we consider again the matrix $A =U \Sigma V ^T$ as a linear transformation, we see that $U$ and $V ^T$ only rotate vectors $||U x ||=||x ||$, $||V x ||=||x ||$ while $\Sigma$ changes their magnitude $||\Sigma x || \leq \sigma_{max}||x ||$. For this reason, the $l_2$ Norm of a matrix is also referred to as the Spectral Norm. During the rest of the work we will also use the notation $||A ||$ to refer to the Spectral Norm.
Another important matrix norm that benefits from SVD is the Frobenius norm, defined in the following way.
::: {.definition name="Frobenius norm"}
Let $A \in \mathbb{R}^{n\times m}$ be a matrix. The Frobenius norm of $A$ is defined as $\|A\|_F = \sqrt{\sum_i^n\sum_j^m a_{ij}^2}$.
:::
It can be shown that also this norm is related to the singular values.
::: {.proposition}
The Frobenius norm of a matrix $A \in \mathbb{R}^{n\times m}$ is equal to the square root of the sum of squares of its singular values.
$$\|A\|_F = \sqrt{\sum_i^r \sigma_{i}^2}$$
:::
```{proof}
$$||A ||_F = \sqrt{\sum_i^n\sum_j^na_{ij}^2} = \sqrt{Tr[A A ^T]} = \sqrt{Tr[(U \Sigma V ^T)(U \Sigma V )^T]} =$$ $$\sqrt{Tr[U \Sigma V ^TV \Sigma U ^T]} = \sqrt{Tr[U \Sigma \Sigma U ^T]} = \sqrt{Tr[U \Sigma ^2U ^T]} = \sqrt{\sum_{i=1}^{n}\sigma^2}$$
From the cyclic property of the trace $Tr[A B ] = Tr[B A ]$ it follows that $Tr[U \Sigma ^2U ^T] = Tr[U ^TU \Sigma ^2] = Tr[\Sigma ^2]$, which is the sum of the squares of the singular values $\sum_{i=1}^{n}\sigma^2$.
```
Another interesting result about the SVD of a matrix is known as the Eckart–Young–Mirsky theorem.
```{theorem, eckart-young-mirsky, name="Best F-Norm Low Rank Approximation"}
\cite{eckart1936approximation}\cite{mirsky1960symmetric}
Let $A \in \mathbb{R}^{n \times m}$ be a matrix of rank $r$ and singular value decomposition $A = U \Sigma V ^T$. The matrix $A ^{(k)} = U ^{(k)}\Sigma ^{(k)}V ^{(k)T}$ of rank $k \leq r$, obtained by zeroing the smallest $r-k$ singular values of $A$, is the best rank-k approximation of $A$.
Equivalently, $A _k = argmin_{B :rank(B )=k}(\|A - B \|_F)$.
Furthermore, $min_{B :rank(B )=k}(\|A - B \|_F) = \sqrt{\sum_{i=k+1}^r{\sigma_i}}$.
```
To get a clearer understanding of this result, we could rewrite $A = U \Sigma V ^T = \sum_i^r\sigma_iu _iv _i^T$ and notice that matrix $A$ is the sum of $r$ matrices $u _iv _i^T$ each scaled by a scalar $\sigma_i$.
In practice, SVD decomposes matrix $A$ in matrices of rank one, ordered by importance according to the magnitude of the singular values: the smaller the $\sigma_i$, the smaller is the contribution that the rank-1 matrix gives to the reconstruction of $A$.
When the smallest singular values are set to 0, we still reconstruct a big part of the original matrix, and in practical cases, we will see that matrices can be approximated with a relatively small number of singular values.
Unfortunately though, calculating the singular vectors and singular values of a matrix is a computationally intensive task. Indeed, for a matrix $A \in \mathbb{R}^{n\times m}$ the cost of the exact SVD is $O\left( min(n^2m, nm^2) \right)$. Recently, there have been developed approximate methods that compute the Eckart-Young-Mirsky approximations of matrices in
time $O(knm)$, where k is the rank of the output matrix [@partridge1997fast], or in times that scale super-linearly on the desired rank and one dimension of the input matrix \@ref(bhojanapalli2014tighter).
## Useful theorems around linear algebra
- [Gershgorin circle theorem](https://en.wikipedia.org/wiki/Gershgorin_circle_theorem)
- [Perron-Frobenius theorem](https://en.wikipedia.org/wiki/Perron%E2%80%93Frobenius_theorem)
- [Sherman-Morrison formula](https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula)
-
## Inequalities
[From here](https://en.wikipedia.org/wiki/Bernoulli%27s_inequality#Related_inequalities).
```{theorem, bernoulli, name="Bernoulli inequalities"}
- Bernoulli inequality: for $\forall n \in \mathbb{N}, x\geq -1$ $$(1+x)^n \geq 1+nx$$ (Reader, check what happens if $n$ even or odd!)
- Generalized bernoully inequality: $\forall r \in \mathbb{R} \geq 1$ or $r \leq 0$, and $x\geq -1$ $$(1+x)^r \geq 1+rx$$.
For $0 \leq r\leq 1$, $$(1+x)^r \leq 1+rx$$
- Related inequality: for any $x,r \in \mathbb{R}$, and $r>0>$: $$(1+x)^r \leq e^{rx}$$
```
```{theorem, jensen, name="Jensen inequality"}
Let $X$ be a random variable with $\mathbb{E}|X| \leq \infty$ and $g : \mathbb{R}\to \mathbb{R}$ a real continious convex function. Then
$$g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)].$$
```
A [mnemonic trick](https://math.stackexchange.com/questions/2364116/how-to-remember-which-function-is-concave-and-which-one-is-convex) to remember the difference between convex and concave. Conv*EX* ends with *EX*, as the word *EX*ponential, [which is convex](https://math.stackexchange.com/questions/702241/how-to-prove-that-ex-is-convex).
```{theorem, ljapunov, name="Ljapunov's inequality"}
For any real $0 \leq p \leq q$ and any real random variable, $\|X\|_p \leq \|X\|_q$.
```
```{theorem, holder, name="Hölder's inequality"}
Let $p,q>1$ that satisfy $\frac{1}{p} + \frac{1}{q} = 1$. If $\|X\|_p \leq \infty$ and $\|X\|_q$ then
$$\mathbb{E}[|XY|] \leq \|X\|_p \dot \|X\|_q.$$
```
```{theorem, minkowski, name="Minkowski's inequality"}
Let $p>1$, $\|X\|_p \leq \infty$ and $\|Y\|_p \leq \infty$. Then:
$$\|X+Y\|_p \leq \|X\|_p + \|Y\|_p.$$
```
<!-- add cite to bristol's lecture notes on martingales -->
## Trigonometry
Always have in mind the following Taylor expansion:
```{theorem, taylor-exponential, name="Taylor expansion of exponential function"}
$$e^{x} = \sum_{k=0}^\infty \frac{x^k}{k!}$$
Note that this series is convergent for all $x$
```
From that, it is easy to derive that:
$$e^{\pm ix}= cos (x) \pm i sin(x) $$
::: {.theorem #euler name="Nunzio's version of Euler's identity"}
For $\tau = 2\pi$,
$$e^{i\tau} = 1 $$
:::
Because of this, we can rewrite $\sin$ and $\cos$ as:
- $\cos(x) = \frac{e^{ix} + e^{-ix}}{2}$
- $\sin(x)= \frac{e^{ix} - e^{-ix}}{2i}$
Note that we can do a similar thing of \@ref(thm:taylor-exponential) for matrices. In this case, we *define* the exponential of a matrix via it's Taylor expansion:
$$e^A = \sum_{k=0}^\infty \frac{A^k}{k!}$$
The matrix exponential has the following nice properties [@symmetryquantum]:
- $(e^A)^\dagger = e^{A^\dagger}$
- $e^{A\otimes I} = e^A \otimes I$
- if $[A,B] = 0$, then $e^Ae^B = e^{A+B}$.
- $Ue^AU^\dagger = e^{UAU\dagger}$
- $det(e^A)= e^{Tr[A]}$
#### Useful equalities in trigonometry {#subsec:trigonometry}
- $\sin(a)\cos(b) = \frac{1}{2}(\sin(a+b)+\sin(a-b)) = 1/2(\sin(a+b)-\sin(b-a) )$
- $2 \cos x \cos y = \cos (x+y) + \cos (x-y)$
```{exercise}
Derive an expression for $\cos(a+b)$.
```
```{proof}
Recall that $e^{ix} = cos(x)+isin(x)$,
$$e^{i(A+B)} = cos(a+b)+isin(a+b) = e^{iA}e^{iB} = (cos(a)+isin(a))(cos(b)+isin(b))$$
$$cos(a+b)+isin(a+b) = cos(a)cos(b) + cos(a)+isin(b)+isin(a)cos(b) - sin(b)sin(a)$$
$$cos(a+b)+isin(a+b) = cos(a)cos(b) + cos(a)isin(b)+isin(a)cos(b) - sin(b)sin(a)$$
From this, it follows
```
# Series
<!--
# TODO Finish section on series (geometric, telescopic, and so on.. )
# This should be a straight-to-the point section on series, with specific focus on CS.
# These series pops up very often in the analysis of algorithms so having them here is really helpful.
# Ask scinawa the notes he already have on this.
# labels: good first issue, help wanted
-->
<!-- The geometric series is a series with a constant ratio between successive terms. The term $r$ is called the *common ratio* and the term $a$ is the first element of the series. -->
<!-- The sum of the geometric series is a power series for the function $f(x)=\frac{1}{1-x}$. In sum notation we can write $\frac{1}{1-x}=\sum_{n=0}^\infty x^n$. -->
<!-- Note that this equal symbol is valid only under certain conditions, i.e. that $x$ is within the convergence ratio of the series expansion. -->
<!-- ```{theorem} -->
<!-- \textcolor{red}{when $|x|\leq 1$?}. Let $|x|\leq 1$. Then, -->
<!-- $$s= \sum_{k=0}^{n-1} ax^k = a\frac{1-x^{n}}{1-x} $$ -->
<!-- ``` -->
<!-- ```{proof} -->
<!-- \begin{itemize} -->
<!-- \item Start: -->
<!-- $$s = a+ax+ax^2\dots ax^{n} $$ -->
<!-- \item Multiply by $x$ on both sides: -->
<!-- $$xs = ax+ax^2 \dots ax^{n+1} $$ -->
<!-- \item Subtract $xs$ from $s$: -->
<!-- $$s - xs = a- ax^n $$ -->
<!-- $$ s(1-x) = (a-ax^n) $$ -->
<!-- \item Conclude that: -->
<!-- $$ s = \frac{(a-ax^n)}{(1-x)} = a(\frac{1-x^n}{1-x})$$ -->
<!-- \end{itemize} -->
<!-- ``` -->
<!-- \textbf{Remark} -->
<!-- If $n \to \infty$ than $x^n = 0$ and ... we have the most common version of -->
<!-- $$\frac{1}{1-x} $$ -->
<!-- Also, TOADD: $\frac{1}{x-a} = ?$ -->
<!-- <!-- Perhaps include the beautiful image of wikipedia of the geometric series! -->
<!-- So, if we wanted to find the Taylor series of $\frac{1}{x}$ we would only need to find some way of representing the new function via the old one. This can be done by changing $x$ to $(1-x)$ in the sum. So our new series is $\frac{1}{1-(1-x)}=\frac{1}{x}=\sum_{n=0}^\infty (1-x)^n$. -->
<!-- - $\frac{1}{x} = \sum_n^\infty (-1)^n (-1 + x)^n$ -->
<!-- - $\frac{1}{x-1} = -1 -x -x^2 -x^3 = \sum_n^\infty -x^n$ -->
<!-- - $\frac{1}{x+1} = 1 -x + x^2 - x^3 + x^4 =$ -->
<!-- - $\frac{1}{1-x} = \sum_n x^n$ -->
<!-- - $\frac{1}{1+x} = 1 - x + x^2 \dots = t\sum_n (-1)^n x^n$ -->
<!-- #### Telescopic series -->
# Probability
## Measure theory
```{definition, sigma-algebra, name="Sigma algebra"}
Let $\Omega$ be a set, and $\Sigma$ be a subset of the power set of $\Omega$ (or equivalently a collection of subsets of $X$). Then $\Sigma$ is a $\sigma$-algebra if:
- $\emptyset\in \Sigma$,
- $\Sigma$ is closed under countable union,
- $\forall S \in \Sigma, \overline{S} \in \Sigma$.
```
Observe that thanks to de Morgan's law, we can equivalently define the sigma algebra to be closed under countable intersection. Oftentimes, it's common to conflate $\Sigma$ and $(\Omega, \Sigma)$, and call both $\sigma$-algebra.
```{definition, measurable-space, name="Measurable space"}
Let $\Omega$ be a set, and $\Sigma$ a $\sigma$-algebra. The tuple $(\Omega, \Sigma)$ is a measurable space (or Borel space).
```
```{definition, measurable-function, name="Measurable function"}
Let $(\Omega, \Sigma)$, and $(Y, T)$ two different measurable space. A function $f : \Omega \mapsto Y$ is said to be measurable if for every $E \in T$:
$$f^{-1}(E):=\{x \in \Omega | f(x) \in E \} \in \Sigma$$
```
A measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in [direct analogy](https://en.wikipedia.org/wiki/Measurable_function) to the definition that a continuous function between topological spaces preserves the topological structure: the preimage of any open set is open.
<!-- https://ece.iisc.ac.in/~parimal/2015/proofs/lecture-17.pdf -->
```{definition, continious-function, name="Continious function"}
Let $(X, \mathbb{X}), (Y, \mathbb{Y})$ two topological spaces. Let $f$ be a function between two topological spaces $f : X \mapsto Y$ is said to be continious if the inverse image of every open subset of $Y$ is open in $X$. In other words, if $V \in \mathbb{Y}$, then its inverse image $f^{-1}(V) \in \mathbb{X}$
```
```{definition, measure-space, name="Measure space"}
The tuple $(\Omega, \Sigma, \mathbb{P})$ is a measure space if:
- $(\Omega, \Sigma)$ is a measurable space.
- $\mu(E)$ is a measure on $(\Omega, \Sigma)$:
- $\mu : \Sigma \mapsto \mathbb{R}+\{-\infty, +\infty\}$
- non-negativity: $\mu(E) \geq 0 \forall E \in \Sigma$
- Null empty set $\mu(\emptyset )= 0$
- Coutable additivity (or $\sigma$-additivity): for all countable collections $\{E_k \}_{k=1}^\infty$ of pariwise disjoint sets in $\Sigma$,
$$\mu \left(\cup_{k=1}^\infty E_k\right) = \sum_{k=1}^\infty \mu(E_k)$$
```
```{definition, probability-space, name="Probability space"}
The tuple $(\Omega, \Sigma, \mathbb{P})$ is a probability space if:
- $(\Omega, \Sigma)$ is a $\sigma$-algebra ($\Omega$ is the set of *outcomes* of the experiment, and $\Sigma$ is the set of *events*)
- $\mathbb{P}$ is a measurable function:
- $\mathbb{P} : \Sigma \mapsto [0,1]$
- Null empty set.
- $\mu$ is countably additive.
- $\mathbb{P}(\Omega)=1$
```
I.e. a probability space is a measure space where the measurable function on $\Omega$ is $1$.
```{definition, name="Complete probability space"}
For $B \subset \Sigma$ s.t. $\mathbb{P}(B)=0$, a $(\Omega, \Sigma, \mathbb{P})$ probability space is complete if $\forall A \subset B$, $A \in \Sigma$.
```
```{definition, equivalence-prob-measure, name="Equivalence between probability measures"}
Let $(\Omega, \Sigma, \mathbb{P}), (\Omega, \Sigma, \mathbb{Q})$ two probability space with the same $\Omega$ and $\Sigma$. We say that $\mathbb{P}$ and $\mathbb{Q}$ are equivalent iif for every $A \in \Sigma$, $\mathbb{P}(A)=0 \Leftrightarrow \mathbb{Q}(A)=0$.
```
It basically means that the two measures agree on the possible and impossible events (even if it is pretty strange to call them equivalent).
```{definition, random-variable, name="Random variable"}
A (real-valued) random variable on a probability space $(\Omega, \Sigma, \mathbb{P})$ is a measurable function $X: \Omega \mapsto \mathbb{R}$.
```
<!-- - Joint probability of two events $P(a \cap b) = P(a,b) = P(a|b|)P(a)$ -->
<!-- - Marginal probability $p(a) = \sum_b p(a,b) = \sum_b p(a|b)p(b)$ -->
<!-- - Union of two events $p(a \cup b)$ -->
<!-- - Sum rule -->
<!-- - Rule of total probability -->
<!-- - Conditional probability -->
<!-- - Bayes' Theorem: $$ p(A=a | B=b) = \frac{p(A=a|B=b)}{p(B=b)} = \frac{p(A=a)p(B=b|A=a)}{\sum_{a'} p(A=a')p(B=b|A=a')}$$ -->
Remember [that](https://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/), for a list of numbers $x_1, x_n$,
- The mode can be defined as $\arg\min_x \sum_i |x_i - x|^0$
- The median can be defined as $\arg\min_x \sum_i |x_i - x|^1$
- The mean can be defined as $\arg\min_x \sum_i |x_i - x|^2$.
#### Union bound
The union bound is used to show that the probability of union (i.e. at least one of them happens) of a finite or countable set of events is less than or equal to the sum of the probabilities of the events.
<span id="unionbound"> </span>
```{theorem, unionbound, name="Union bound"}
$\forall$ events $A_1 \dots A_n \in \Sigma$:
$$P(\cup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i)$$
```
<!-- ```{proof} -->
<!-- It would be nice to have a proof by induction here, which is the standard proof for showing the union bound, based on the axioms of probability theory. -->
<!-- ``` -->
```{exercise}
In Erdős–Rényi graphs $G(n,p)$, (that is, a graph with $n$ nodes with probability $p$ that each of the two nodes are connected). We define the event $B_n$ as the event where a graph $G(n,p)$ has at least one isolated node. Show that $P(B_n) \leq n(1-p)^{n-1}$.
```
```{proof}
Let $A_i, i\in[n]$ the event that node $i$ is isoldated. Its probability, from the definition of $G(n,p)$ is $(1-p)^{n-1}$, because there might be an edge with probability $p$ with other $n-1$ nods. From this, applying directly the union bonund we obtain an upper bound on the probability that there is at least one isoldated node is in the graph:
$$P(B_n) = P(\cup_{i=1}^n A_i) \leq \sum_i P(A_i) \leq nP(A_i) = n(1-p)^{n-1}$$
```
```{exercise, example-algorithm-unionbound}
Suppose we run 4 times a randomized algorithm, with success probability $1-\delta$. Can you bound the probability that we never fail using the union bound?
```
```{proof}
Let $f_i$ the event that we fail running our algorithm at time $i$.
We know that the failure probability $f_i$ is $\delta$ for all $i \in [4]$. Thanks to the union bound we can bound the probability that we fail at least once:
$P(\cup_i^k f_i ) \leq \sum_i^4 \delta = 4\delta$.
It follows that the have 4 success in a row is *lower* bounded by $1-4\delta$.
Note that we could have also bypassed the union bound and compute this quantity analitically, as the probability of getting 4 success in a row would be $(1-\delta)^4$, which we can compute with the binomial theorem \@ref(thm:binomial-theorem).
```
<!-- Maybe we can be more precise in this exercise, by saying the interpretation of the values of difference between the analytic formula and the bound. Also we should say that the union bound might give "probabilities" bigger than one (i.e. if $\delta < 1/4$ in this case). Overall, I'm not satisfied with the level of clarity of this example, but i think it's nice to have. -->
<span id="variance"> </span>
::: {.definition #variance name="Variance"}
\begin{align}
\operatorname{Var}(X) &= \operatorname{E}\left[(X - \operatorname{E}[X])^2\right] \\[4pt]
&= \operatorname{E}\left[X^2 - 2X\operatorname{E}[X] + \operatorname{E}[X]^2\right] \\[4pt]
&= \operatorname{E}\left[X^2\right] - 2\operatorname{E}[X]\operatorname{E}[X] + \operatorname{E}[X]^2 \\[4pt]
&= \operatorname{E}\left[X^2 \right] - \operatorname{E}[X]^2
\end{align}
:::
```{exercise}
How can we express the variance as expectation of quantum states? What quantum algorithm might we run to estimate the variance of a random variable $M$?
$$\braket{\psi|M^2|\psi} - (\braket{\psi|M|\psi})^2 $$
Discuss.
```
```{definition, expofamily, name="Exponential Family [@murphy2012machine]"}
A probability density function or probability mass function $p(v|\nu)$ for $v = (v_1, \cdots, v_m) \in \mathcal{V}^m$, where $\mathcal{V} \subseteq \mathbb{R}$, is a $\sigma$-algebra over a set $X$ $\nu \in \mathbb{R}^p$ is said to be in the exponential family if it can be written as:
$$p(v|\nu) := h(v)\exp \{ o(\nu)^TT(v) - A(\nu) \}$$
where:
- $\nu \in \mathbb{R}^p$ is called the \emph{canonical or natural} parameter of the family,
- $o(\nu)$ is a function of $\nu$ (which often is just the identity function),
- $T(v)$ is the vector of sufficient statistics: a function that holds all the information the data $v$ holds with respect to the unknown parameters,
- $A(\nu)$ is the cumulant generating function, or log-partition function, which acts as a normalization factor,
- $h(v) > 0$ is the \emph{base measure} which is a non-informative prior and de-facto is scaling constant.
```
#### Bias-variance tradeoff
[Here](http://fourier.eng.hmc.edu/e176/lectures/probability/node9.html) is a nice reference to understand the bias-variance tradeoff
### Boosting probabilities with "median lemma" (or powering lemma )
In this section we discuss the following, widely known result in CS. It's used not only in writing algorithms, but also in complexity theory.
```{lemma, powering-lemma, name="Powering lemma [@jerrum1986random]"}
Let $\mathcal{A}$ be a quantum or classical algorithm which aims to estimate some quantity $\mu$, and whose output $\widetilde{\mu}$ satisfies $|\mu-\widetilde{\mu} |\leq \epsilon$ except with probability $\gamma$, for some fixed $\gamma \leq 1/2$. Then, for any $\gamma > 0$ it suffices to repeat $\mathcal{A}$ $O(\log 1/\delta)$ times and take the median to obtain an estimate which is accurate to within $\epsilon$ with probability at least $1-\delta$.
```
<!-- Suppose the following Lemma: -->
<!-- ```{theorem} -->
<!-- There exist an algorithm $A$ output $\overline{\mu}$ that estimates $\mu$ with probability $1/16$ with error $\epsilon m$. That is: -->
<!-- $$P[ |\mu - \overline{\mu} | \leq m \epsilon ] \geq \frac{1}{16} $$ -->
<!-- ``` -->
<!-- Using the median trick, we can boost the probablity of success like this: -->
<!-- ```{theorem, name="[@betterupperboundSDP"} -->
<!-- There exist an algorithm $A'$ that estimates $\mu$ with probability $1-\delta$ with error $\epsilon m$. It is obtained by repeating $O(\log(1/\delta)$ the algorithm $A$. -->
<!-- ``` -->
<!-- ``` -->
<!-- Let's pick the median of $K=O(\log(1/\delta)$ repetitions. Let's $F_i$ for $i \in [K]$ the output of previous algorithm. Let $z_K$ denote the median. -->
<!-- $$Pr(|z_k - \mu| \geq \epsilon m) = Pr(z_k - \geq \epsilon m + \mu ) + Pr(z_k \leq \mu - \epsilon m )$$ -->
<!-- We can upper bound the first term as: \textcolor{red}{detail better why first passage} -->
<!-- \begin{align*} -->
<!-- (*) &\leq \sum_{I \subseteq [K]: |I| \geq K/2} \prod_{i \in I} \mathrm{Pr}\big( (F)_i \geq \epsilon m + \mu \big) \\ -->
<!-- &\leq (|\{I \subseteq [K]: |I| \geq K/2 \}| ) \left(\frac{1}{16}\right)^{K/2} \\ -->
<!-- &= 2^{K-1}\left(\frac{1}{4}\right)^K \\ -->
<!-- &\leq \frac{1}{2}\textcolor{red}{\left(\frac{1}{2}\right)^{\log_2(1/\delta)}} = \frac{1}{2} \delta. -->
<!-- \end{align*} -->
<!-- Analogously, one can show that $Pr\big( z_K \leq - \epsilon m + \mu \big) \leq \frac{1}{2} \delta$. Hence -->
<!-- \[\mathrm{Pr}\big(|z_K - \mu| \geq \epsilon m\big) \leq \delta. \] -->
<!-- ``` -->
## Markov chains
Useful resources: [here](https://www.probabilitycourse.com/chapter11/11_2_4_classification_of_states.php#:~:text=A%20Markov%20chain%20is%20said%20to%20be%20irreducible%20if%20all,always%20stay%20in%20that%20class.), [here](http://www.columbia.edu/~ww2040/4701Sum07/4701-06-Notes-MCII.pdf).
(ref:serfozo2009basics) [@serfozo2009basics]
```{definition, markov-chain, name="Markov chain (ref:serfozo2009basics)"}
Let $(X_t)_{t \in I}$ be a stochastic process defined over a probability space $(\Omega, \Sigma, \mathbb{P})$, for a countable set $I$, where $X_t$ are random variables on a set $\mathcal{S}$ (called state space). Then $(X_t)_{t \in I}$ is a Markov chain if, for any $j \in \mathcal{S}$ and $t \geq 0$, it holds that
$$\mathbb{P}[X_{t+1} = j | X_0, X_1, \dots X_t] = \mathbb{P}[X_{t+1} =j | X_t]$$
and for all $j,i \in \mathcal{S}$, it holds that
$$\mathbb{P}[X_{t+1} = j | X_t = i] = p_{ij}$$,
where $p_{ij}$ is the transition probability for the Markov chain to go from state $i$ to state $j$.
```
Less formally, a Markov chain is a stochastic process with the Markov property, for which we can just use a matrix $P$ to identify its transition probability. Most of the time, we will discretize the state space $\mathcal{S}$, so we can label elements of $\mathcal{S}$ with integers $i \in [|\mathcal{S}|]$. This fact will allow us to conflate the (push-forward) measure $\mathcal{P}$ on $\mathcal{S}$ and the matrix $P$.
A state $j$ is said to be *accessible* from $i$ (written as $i \mapsto j$) if $P_{ij}^t > 0$ for some $t$, where $P^t$ is the $t$-th power of the transition matrix $P$. A communication class is an equivalence releation between states (relatively simple to prove) where two states $j,i$ are said to communicate if they are mutually accessible.
```{definition, irreducible-markov-chain, name="Irreducible markov chain"}
A Markov Chain $(X_t)_{t \in I}$ is irreducible if and only if
- there exist some integer $t \in I$ such that $p^t_{ij} > 0$ for all $i,j \in \mathcal{S}$
there exist some integer $t \in I$ such that $P[X_t =j| X_0 = i] > 0$, for all $i,j \in \mathcal{S}$
- there is only one communication class.
The previous conditions are equivalent.
```
In terms of random walks, irreducibility means that: if the graph is undirected, the graph is has only one connected component (i.e. is connected), and if the graph is directed, the graph is strongly connected.
## Distributions
[This](http://web.mit.edu/urban_or_book/www/book/chapter7/7.1.3.html) is a beautiful guide that shows you how to draw samples from a probability distribution.
- [Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution)
-
## Concentration inequalities
Take a look at [this](https://www.youtube.com/watch?v=Rd8LQbXhWvM) and [this](http://www.stat.rice.edu/~jrojo/PASI/lectures/TyronCMarticle.pdf). Also recall that the Union bound was presented in the section dedicated for probability theory, i.e. Theorem \@ref(thm:unionbound).
### Markov inequality
The Markov inequality is an *upper bound for the probability that a
non-negative function of a random variable*, that is greater than or
equal to a positive constant. Especially in analysis, people refer to it
as Chebyshev's inequality (sometimes, calling it the *first* Chebyshev
inequality, while referring to the "usual" [Chebyshev's inequality](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality) as the second Chebyshev inequality or Bienaymé–Chebyshev inequality).
```{theorem, markov, name="Markov inequality"}
For all non-negative random variable, and $a > 0$, we have that:
- $Pr(X \geq a) \leq \frac{E[X]}{a}$
- $Pr(X \geq aE[X]) \leq \frac{1}{a}$
```
```{proof}
Observe that :
$$E[X] = P(X < a) \cdot E[X|X<a] + P(X > a) \cdot E[X|X>a]$$
As both of these expected values are bigger than zero, (using the nonnegativity hypothesis) we have that
$$E[X] \geq P(X > a) \dot E[X|X>a] $$
Now is easy to observe that $E[X|X>a]$ is at least $a$, and by rearranging we obtain that:
$$ \frac{E[X]}{a} \geq P(X > a) $$
The second statement of the theorem follows from substitution, i.e. setting $b=aE[X]$ and using the previous statement on $Pr(X \geq b)$.
```
A very useful corollary of Markov inequality is the following.
```{theorem, coro-markov, name="Corollary of Markov inequality"}
Let $f$ be a monotone increasing (or noll) function on a space $I$, and define the random variable on $Y$.
$$ P(Y \geq b) \leq \frac{E[f(Y)]}{f(b)} $$
```
### Chebyshev inequality
This inequality tells us about the probability of finding the random
variable $X$ away from the mean $\mathbb{E}[X]$ is bounded by the variance of
$X$.
```{theorem, chebyshev, name="Chebyshev inequality"}
Let $X$ be a random variable with finite mean $\mu$ and variance $\sigma$. For $\epsilon > 0$:
$$Pr[|X - \mathbb{E}[X]| \geq \epsilon]\leq \frac{\sigma^2}{\epsilon^2}$$
Moreover, if $k = \epsilon /\sigma$ we can replace $\epsilon$ with $k\sigma$ and obtain:
$$Pr[|X - \mathbb{E}[X]| \geq k\sigma] \leq \frac{1}{k^2}$$
```
```{proof}
Observe that $(X-\mu)^2$ is a non-negative random variable. Therefore,
$$P(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2)$$
Now since $(X-\mu)^2$ is a non-negative random variable, we can apply Markov inequality to get :
$$P(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2) \leq \frac{E[(X-\mu)^2]}{\epsilon^2}$$
$$(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2) \leq \frac{[Var(X)]}{\epsilon^2}$$
```
<!-- # check if linearly smaller is correct/not confusing here-->
It is very useful to see what happen when we define a new random
variable $Y$ as the sample mean of $X_1 \dots X_n$ other random
variables (iid) indipendent and identically distributed:
$Y= \frac{1}{n}\sum_i^n X_i$. The expected value of $Y$ is the same as
the expected value of $X$, but the variance is now linearly smaller in
the number of samples:
$$E[Y] = \frac{1}{n} \sum_i^n E[X_i] = \mathbb{E}[X_i] \text{ for any } i$$
$$Var[Y] = \frac{1}{n^2} \sum_i^n \text{Var}[X_i] = \frac{Var[X_i]}{n} \text{ for any } i$$
This allows us to obtain the following bound:
:::{.theorem #chebyshev-mean name="Chebyshev inequality for sample mean"}
Let $Y= \frac{1}{n}\sum_i^n X_i$. Then,
$$Pr[|Y - E[Y]| \geq \epsilon]\leq \frac{\sigma^2}{n\epsilon^2}$$
:::
### Weak Law of large numbers
```{theorem, wlln, name="Weak Law of large numbers"}
Let $X_1, X_2, \dots, X_n$ be i.i.d random variables with a finite expected value $\mathbb{E}[X_i]=\mu \leq \infty$, and let $\overline{X}$ be the average $\frac{1}{n}\sum_i^n X_i$. Then, for any $\epsilon > 0$, we have that:
$$\lim_{n\to +\infty} P\left( |\overline{X} - \mu | \geq \epsilon \right) = 0$$
```
```{proof}
We know that $E[\overline{X}] = \mu$ and $Var(\overline{X}) = \frac{\sigma^2}{n}$. By Chebyshev Inequality for the sample mean (Theorem \@ref(thm:chebyshev-mean)):
$$P(|\overline{X}-\mu| > \epsilon) \leq \frac{Var(\overline{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2}$$
Trivially, $\lim_{n \to \infty} \frac{\sigma^2}{n\epsilon^2} = 0$, concluding the proof.
```
### Strong Law of Large Numbers
```{theorem, slln, name="(Strong) Law of large numbers"}
Let $X_1,X_2,X_3,\dots,X_n$ be i.i.d random variables with mean $\mu$. Let $\overline{X} = \sum_{i=1}^{n} X_i$ be the sample mean. Then, $\overline{X}$ converges almost surely to $\mu$:
$$P(\lim_{x \to -\infty}\overline{X}_n = \mu) = 1$$
```
:::{.remark}
The SLLN implies WLLN but not vice-versa.
:::
### Chernoff bound
From [here](https://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf) and [here](http://www.stat.cmu.edu/~arinaldo/Teaching/36709/S19/Scribed_Lectures/Jan29_Tudor.pdf), [here](https://polynomiallybounded.wordpress.com/2017/05/23/how-i-remember-the-chernoff-bound/), and [here](https://www.probabilitycourse.com/chapter6/6_2_3_chernoff_bounds.php)
We focus on a restricted class of random variables, i.e. the case when our random variable is obtained as the sum of *indipendent* other random variables. Central limit theorem says that, as $n \to \infty$, the value $\frac{X-\mu}{\sigma}$ approaches the standard normal distribution $N(0,1)$. Hoewver, it does not tell any information on the rate of convergence.
:::{.theorem #chernoff-bound name="Chernoff bound"}
Let $X=\sum_i^n X_i$ where $X_i =1$ with probability $p_i$ and $X_i=0$ with probability $(1-p_i)$, and all $X_i$ are independent. Let $\mu=E[X] = \sum_i^n p_i$. Then:
- *Upper tail*: $P(X \geq (1+\delta)\mu) \leq e^-\frac{\delta^2}{2+\delta}\mu$ for all $\delta > 0$
- *Lower tail*: $P(X \leq (1-\delta)\mu) \leq e^{\mu\delta^2/2}$ for all $0 \leq \delta \leq 1$
:::
You can find a nice proof [here](https://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf).
<!-- ```{proof} -->
<!-- If $X = \sum_i^n X_i$ where $X_1,X_2, \dots, X_n$ are i.i.d variables, then since the MGF (Moment Generating Function) of the (independent) sum equals the product of the MGFs. Taking our general result from above and using this fact, we get: -->
<!-- \begin{equation} -->
<!-- P(X \geq k)\leq\min_{t>0}\frac{M_x(t)}{e^{tk}}=\min_{t>0}\frac{\prod_{i=1}^n M_{x_i}(t)}{e^{tk}} -->
<!-- \end{equation} -->
<!-- Let’s derive a Chernoff bound for $X∼Bin(n, p)$, which has the form $P(X \geq (1+\delta)\mu)$ for $\delta > 0$. For example with $\delta = 4$ you may want to bound $P(X \geq 5*E[X])$. -->
<!-- Recall $X = \sum_i^n X_i$ where $X_i ~ Ber(p)$ are i.i.d, with $\mu = E[X] = np$. -->
<!-- \begin{align} -->
<!-- M_{X_i(t)} =& E[e^tX_i] \\ -->
<!-- =& e^{t \dot 1}px_i(1)+e^{t \dot 0}px_i(0) \\ -->
<!-- =& pe^t + 1(1-p) \\ -->
<!-- =& 1 + p(e^t - 1) \\ -->
<!-- \leq& e^{p(e^t - 1)} -->
<!-- \end{align} -->
<!-- Now using the result from earlier and plugging in the MGF for the $Ber(p)$ distribution, we get: -->
<!-- \begin{align} -->
<!-- P(X \geq k) \leq& \min_{t>0}\frac{\prod_i^n M_{X_i(t)}}{e^{tk}} \\ -->
<!-- =& \min_{t>0}\frac{(e^{p(e^t-1)})^n}{e^{tk}}\\ -->
<!-- =& \min_{t>0}\frac{e^{np(e^t-1)}}{e^{tk}} \\ -->
<!-- =& \min_{t>0}\frac{e^{\mu(e^t-1)}}{e^{tk}} -->
<!-- \end{align} -->
<!-- For our bound, we want something like $P(X \geq (1+\delta)\mu)$, so our $k = (1+\delta)\mu$. To minimize the RHS and get the tightest bound, the best bound we get is by choosing $t=ln(1+\delta)$ after some terrible algebra (take the derivative and set to 0). We simply plug in k and our optimal value of t to the above equation: -->
<!-- \begin{align} -->
<!-- P(X \geq (1+\delta)\mu)\leq &\frac{e^(\mu(e^ln(1+\delta)-1))}{e^(1+\delta)\mu \ln(1+\delta)} = \frac{e^\mu((1+\delta)-1)}{(e^(ln(1+\delta))^(1+\delta)\mu} \\ -->
<!-- = &\frac{e^(\delta\mu)}{(1+\delta)^(1+\delta)\mu} = (\frac{e^\delta}{(1+\delta)^(1+\delta)})^\mu -->
<!-- \end{align} -->
<!-- Again, we wanted to choose t that minimizes our upper bound for the tail probability. Taking the derivative with respect to t tells us we should plug in $t=ln(1+\delta)$ to minimize that quantity. This would actually be pretty annoying to plug into a calculator. We actually can show that the final $RHS$ is $\leq exp(\frac{-\delta^2\mu}{2+\delta})$ with some more messy algebra. Additionally, if we restrict $0 < \delta < 1$, we can simplify this even more to the bound provided earlier: -->
<!-- $$P(X \geq (1+\delta)\mu) \leq exp(\frac{-\delta^2\mu}{3})$$ -->
<!-- The proof of the lower tail is entirely analogous, except optimizing over $t < 0$ when the inequality flips. It proceeds by taking $t=ln(1-\delta).$ -->
<!-- We also get a lower tail bound: -->
<!-- $$P(X \leq (1-\delta)\mu)\leq \left(\frac{e^-\delta}{(1-\delta)^(1-\delta)} \right)\leq \left(\frac{e^-\delta}{e^-(\delta+\frac{\delta^2}{2})}\right)= \exp(\frac{-\delta^2\mu}{2})$$ -->
<!-- ``` -->
```{theorem, chernoff-bound2, name="Chernoff bound"}
Suppose $X_1, \dots, X_t$ are independent random variables taking values in $\{0,1\}$. Let $M_t= (X_1 + \dots X_t)/t$ denote their average value. Then for any $0 < \epsilon < 1$,
- (Multiplicative) $Pr[M_t - \mu \leq -\epsilon \mu] \leq \exp^{-\frac{t\mu\epsilon^2}{2}}$ and $Pr[M_t - \mu \geq \epsilon \mu] \leq \exp^{-\frac{t\mu\epsilon^2}{3}}$
- (Additive) $Pr[M_t - \mu \leq -\epsilon ] \leq exp^{-2t\epsilon^2}$ and $Pr[M_t - \mu \geq \epsilon ] \leq \exp^{-2t\epsilon^2}$
```
:::{.remark}
*Trick:* if our random variables are not between $0$ and $1$, we can define $Y_i=X_i/max(X_i)$
:::
### Hoeffding inequality
```{lemma, Hoeffding, name="Hoeffding inequality"}
Let $X_1,\ldots,X_k$ be independent random variables bounded by the interval $[a, b]$. Define the empirical mean of these variables by $\overline{X} = \frac{1}{k} (X_1+\cdots+X_k)$, then
$$Pr(|\overline{X} - \mathbb{E}[X]|\leq \epsilon) \geq 1-2
\exp\left(-\frac{2k\epsilon^2}{b-a} \right).$$
Consequently, if $k\geq (b-a)\epsilon^{-2}\log(2/\eta)$, then
$\overline{X}$ provides an $\epsilon$-approximation of $\mathbb{E}[X]$ with probability at least $1-\eta$.
```