appendix.Rmd

# (APPENDIX) Appendix {-}

# Math and linear algebra

## Norms, distances, trace, inequalities

Take a look at [this](https://www.math.usm.edu/lambers/mat610/sum10/lecture2.pdf), and [this](http://www-math.mit.edu/~djk/calculus_beginners/chapter03/section03.html).

Just remember that for a complez number $z$, $z\overline{z} = a^2 + b^2= |z|^2$, and $\frac{1}{z} = \frac{\overline{z}}{\overline{z}z}$, and the conjugation is the mapping $z = (a+ib) \mapsto \overline{z} = (a-ib)$.


```{definition, name="Inner product"}
A function $f: A \times B \mapsto \mathbb{C}$ is called an inner product $(a,b) \mapsto z$ if:

-  $(a,b+c) = (a,b)+(a,c)$ (and respectively for $(a+c,b)$ )
-  $(a,\alpha b) = \alpha (a,b)$

```


```{definition, name="Norm"}
A function $f : \mathbb{R}^d \mapsto \mathbb{R}$ is called a norm if:

- $\|x\| \geq 0 \forall x \in \mathbb{R}^{d}$, also $\|x\|=0$ iff $x=0$ (positive definiteness)
- $\|\alpha x\| = \alpha \|x\| \forall x \in \mathbb{R}^{d}$ and $\forall \alpha \in \mathbb{R}$ (positively homogeneous)
- $\|x\|-\|y\| \leq  \|x + y\| \leq \|x\| + \|y\|$ (triangle inequality).

```


Note that along with triangle inequality you might also need to know the [reverse triangle inequality](https://www.quora.com/What-is-an-intuitive-explanation-of-the-reverse-triangle-inequality)

The triangle inequality is basically the [subadditivity](https://en.wikipedia.org/wiki/Subadditivity) property of the norm. It is simople to see that norms are **not** linear operators.


```{theorem, name="Cauchy(-Bunyakovsky)-Schwarz"}
$$|(x,y)| \leq \|x\| \|y\|$$
```

```{proof}
Note that by taking the square on both sides we get: $(x,y)^2 \leq (x,x)(y,y)$.
Substituting $(x,y)=\|x\| \|y\| cos(\theta)$, we get:
$$|\|x\|^2\|y\|^2 cos^2(\theta) | \leq (x,x)(y,y)$$
The inequality follows from noting that $|cos(\theta)|$ is always $\leq 1$.
```

```{remark}
It is simple to see - using Cauchy-Schwarz - that for a vector $x$ we have that:
$$\|x\|_1 \leq \sqrt{n} \|x\|_2 $$
```


<!-- \paragraph{Matrix norms} -->
<!-- \begin{itemize} -->
<!--     \item All properties of vector norms .. -->
<!-- \item Submultiplicativity -->
<!--     \item  -->
<!-- \end{itemize} -->

We will use the following matrix norms:

- $\|A\|_0$ as the number of non-zero elements of the matrix $A$,
- $\|A\|_1 = \max\limits_{1 \leq j \leq n} \sum_{i=0}^n |a_{ij}|$ is the maximum among the sum of the absolute value along the columns of the matrix,
- $\|A\|_2 = \|A\| = \sigma_1$ is the biggest singular value of the matrix,
- $\|A\|_\infty = \max\limits_{1 \leq i \leq m} \sum_{j=0}^n  |a_{ij}|$ is the maximum among the sum of the absolute values along the rows of the matrix,
- $\norm{A}_{\max}$ is the maximal element of the matrix in absolute value.
- $\norm{A}_F$ is the Frobenius norm of the matrix, defined as $\sqrt{\sum_{ij}a_{ij}^2}$

Note that for symmetric matrices, $\|A\|_\infty = \|A\|_1$.


<!-- from dequantization of qsfa -->
```{exercise, name="bound error on product of matrices"}
Suppose that $\|A - \overline{A}\|_F \leq \epsilon \|A\|_F$.
Bound $\|A^TA - \overline{A}^T\overline{A}\|_F$
```

<!-- ```{proof} -->
<!-- $$A^TA + \overline{A}^TA - \overline{A}^TA  - \overline{A}^T\overline{A}=$$ -->
<!-- $$( A^TA + \overline{A}^TA - \overline{A}^TA  - \overline{A}^T\overline{A}=$$ -->
<!-- ``` -->


```{definition, distance, name="Distance"}
A function $f : \mathbb{R}^d \times \mathbb{R}^d \mapsto \mathbb{R}$ is called a distance if:

- $d(x,y) \geq 0$
- $d(x,y) = 0$ iif $x=y$
- $d(x,y)=d(y,x)$
- $d(x,z) \leq d(x,y) + d(y,z)$

```


```{definition, convexity-concavity, name="Convex and concave function"}
A function $f$ defined on a convex vector space $D$ is said to be *concave* if, $\forall \lambda \in [0,1]$, and $\forall x,y \in D$:

$$f\left( (1-\alpha)x + \alpha y \right)  \geq (1-\alpha) f(x) + \alpha f(y)$$
Conversely, a function $f$ defined on a convex vector space $D$ is said to be *convex* if, $\forall \lambda \in [0,1]$, and $\forall x,y \in D$:
$$f\left( (1-\alpha)x + \alpha y \right)  \leq (1-\alpha) f(x) + \alpha f(y)$$

```

#### Properties of the trace operator

- $Tr[A+B] = Tr[A] + Tr[B]$
- $Tr[A\otimes B] = Tr[A]Tr[B]$
- $Tr_1[A\otimes B] = Tr[A]B$
- $Tr[\ket{a}\bra{b}] = \braket{a|b}$
- $Tr[AB] = \langle A, B \rangle$

where inner product between matrices is basically defined pointwise as $\sum_{ij}a_{ij}b_{ij}$

```{exercise}
 Can you show that the last identity is true?
```

#### Properties of tensor product

Given two liner maps $V_1 : W_1 \mapsto V_1$ and $V_2 : W_2 \mapsto V_2$ we define the tensor product as the linear map:
$$V_1 \otimes V_2 : V_1 \otimes V_2 \mapsto W_1 \otimes W_2 $$

- $\alpha v \otimes w = v \otimes \alpha w = \alpha(v \otimes w)$
- $( v_1 + v_2) \otimes w = (v_1 \otimes w) + (v_2 \otimes w)$ (and the symmetric of it)
- $\ket{\psi_1}\bra{\phi_1} \otimes \ket{\psi_2}\bra{\phi_2} = \ket{\psi_1}\ket{\psi_2} \otimes \bra{\phi_1}\bra{\phi_2}$ <!-- improve, check.. -->

When a basis is decided for representing linear maps between vector spaces, the tensor product becomes the Kroeneker product.


#### Useful inequalities

```{r, amgm-images, fig.cap="AMGM inequality from wikipedia and twitter", echo=FALSE}
knitr::include_graphics("images/amgm1.png")
knitr::include_graphics("images/amgm2.png")

```


```{theorem, binomial-theorem, name="Binomial theorem"}
$$(a+b)^n = \sum_{k=0}^n {n \choose k} a^kb^{n-k}$$
```


## Linear algebra

### Eigenvalues, eigenvectors and eigendecomposition of a matrix

Real matrices are important tools in Machine Learning as they allow to comfortably represent data and describe the operations to perform during an algorithm. Eigenvectors and eigenvalues are fundamental linear algebra concepts that provide important information about a matrix.

(ref:linearalgebragilbert) [@linearalgebragilbert]

```{definition, name="Eigenvalues and Eigenvectors (Section 6.1 page 289 from (ref:linearalgebragilbert) )"}
Let $A$ be a $\mathbb{R}^{n\times n}$ square matrix, $q \in \mathbb{R}^{n}$ a non-zero vector and $\lambda$ a scalar. If the following equation is satisfied
$$Aq = \lambda q,$$
then $q$ is said to be an eigenvector of matrix $A$ and $\lambda$ is its associated eigenvalue.
```

To have a geometric insight into what eigenvectors and eigenvalues are, we can think of any matrix as a linear transformation in the $\mathbb{R}^n$ space. Under this light, we can say that the eigenvectors of a matrix are those vectors of the space that, after the transformation, lie on their original direction and only get their magnitude scaled by a certain factor: the eigenvalue.

The eigenvalues reveal interesting properties of a matrix. For example, the trace of a matrix (i.e. the sum of the element along the main diagonal of a square matrix) is the sum of its eigenvalues
$$Tr[A] = \sum_i^{n}\lambda_i,$$
and its determinant is equal to the product of the eigenvalues (Section 6.1 page 294 from [@linearalgebragilbert])
$$det(A) = \prod_i^{n}\lambda_i.$$

Moreover, a matrix $A$ with  eigenvalues $\{\lambda_1, ..., \lambda_k\}$ has an inverse only if all the eigenvalues are not zero. The inverse has eigenvalues $\{\frac{1}{\lambda_1}, ..., \frac{1}{\lambda_k}\}$.

Generally, one eigenvalue can be associated with multiple eigenvectors. There might be a set of vectors $E(\lambda) \subseteq \mathbb{R}^n$ such that for all those vectors $q \in E(\lambda): Aq = \lambda q$. That is why for each eigenvalue we talk about an eigenspace.

(ref:scapellato) [@scapellato]


```{definition, eigenspace, name="Eigenspace (Definition 7.1.5 page 108 (ref:scapellato) )"}
Let $A$ be a $\mathbb{R}^{n\times n}$ square matrix and $\lambda$ be an eigenvalue of $A$. The eigenspace of $A$ related to $\lambda$ is the space defined over the set of vectors $E(\lambda) = \{ x: Ax = \lambda x\}$.
```

For each eigenspace, through the Gram-Schmidt procedure, starting from linearly independent vectors it is possible to identify a set of orthogonal eigenvectors that constitute a basis for the space. The basis that spans the space where all the eigenvectors of a matrix lie is called eigenbasis.

```{definition, name="Eigenbasis"}
A basis for the space where all the eigenvectors of a matrix lie is called eigenbasis.
```

An important result is that vectors in different eigenspaces are linearly independent.

```{lemma, name="Linear independence of eigenvectors (Lemma 7.2.3 page 112 from (ref:scapellato) )"}

The set of vectors obtained by the union of the bases of the eigenspaces of a matrix is linearly independent.
```
This means that if the sum of the dimensions of the eigenspaces $\sum_i dim(E(\lambda_i))$ equals $n$, it is possible to find $n$ eigenvectors of $A$ that form a basis for the $\mathbb{R}^n$ space. If that is the case, each vector that lies in $\mathbb{R}^n$ can be written as a linear combination of the eigenvectors of $A$.
Interestingly, matrices that have $n$ linearly independent eigenvectors can be decomposed in terms of their eigenvalues and eigenvectors.

```{theorem, name="Eigendecomposition or Diagonalization"}
\cite[Section 6.2 page 304]{linearalgebragilbert}
Let $A \in \mathbb{R}^{n \times n}$ be a square matrix with $n$ linearly independent eigenvectors. Then, it is possible to decompose the matrix as
$$A = Q\Lambda Q^{-1}.$$
Where $Q \in \mathbb{R}^{n\times n}$ is an square matrix and $\Lambda \in \mathbb{R}^{n\times n}$ is a diagonal matrix. In particular, each $i^{th}$ column of $Q$ is an eigenvector of $A$ and the $i^{th}$ entry of $\Lambda$ is its associated eigenvalue.
```

The matrices that can be eigendecomposed are also said diagonalizable, as in practice the theorem above states that such matrices are similar to diagonal matrices. Unfortunately, not all the square matrices have enough independent eigenvectors to be diagonalized. The Spectral Theorem provides us with a set of matrices that can always be eigendecomposed.

```{theorem, spectral-theorem, name="Spectral theorem"}
\cite[Spectral Theorem page 339]{linearalgebragilbert}
Every symmetric matrix is diagonalizable $A = Q\Lambda Q^{-1}$. Furthermore, its eigenvalues are real and it is possible to choose the columns of $Q$ so that it is an orthogonal matrix.
```
Recall that a matrix $Q$ is said to be orthogonal if $QQ^T=Q^TQ = \mathbb{I}$, therefore $Q^{-1} = Q^T$.
The Spectral theorem, together with the fact that matrices like $A^TA$ and $AA^T$ are symmetric, will come in handy in later discussions.

Being able to eigendecompose a matrix allows performing some computations faster than otherwise. Some examples of operations that gain speed-ups from the eigendecomposed representation are matrix inversion and matrix exponentiation. Indeed, if we have a matrix $A=Q\Lambda Q^{-1}$ its inverse can be computed as $A^{-1}=Q\Lambda^{-1}Q^{-1}$
where $\Lambda^{-1} = diag([\frac{1}{\lambda_1}, ... \frac{1}{\lambda_n}])$. It is easy to check that this matrix is the inverse of $A$:
$$AA^{-1} = (Q\Lambda Q^{-1})(Q\Lambda^{-1}Q^{-1}) = Q\Lambda\Lambda^{-1}Q^{-1} = QQ^{-1} = \mathbb{I}$$
$$A^{-1}A = (Q\Lambda^{-1}Q^{-1})(Q\Lambda Q^{-1}) = Q\Lambda^{-1}\Lambda Q^{-1} = QQ^{-1} = \mathbb{I}.$$
At the same time, the eigendecomposition of a matrix allows performing matrix exponentiation much faster than through the usual matrix multiplication. In fact, it is true that $A^p = Q\Lambda^pQ^{-1}$.
For instance,
$$A^3 = (Q\Lambda Q^{-1})(Q\Lambda Q^{-1})(Q\Lambda Q^{-1}) = Q\Lambda(Q^{-1}Q)\Lambda(Q^{-1}Q)\Lambda Q^{-1} = Q\Lambda\Lambda\Lambda Q^{-1} = Q\Lambda^3Q^{-1}.$$
Computing big matrix powers such as $A^{100}$, with its eigendecomposed representation, only takes two matrix multiplications instead of a hundred.

Traditionally, the computational effort of performing the eigendecomposition of a $\mathbb{R}^{n \times n}$ matrix is in the order of $O(n^3)$ and may become prohibitive for large matrices [@partridge1997fast].


### Singular value decomposition
<!--
# TODO Adjust citations and labels
-->
Eigenvalues and eigenvectors can be computed only on square matrices. Moreover, not all matrices can be eigendecomposed. For this reason, we introduce the concepts of singular values and singular vectors, that are closely related to the ones of eigenvalues and eigenvectors, and offer a decomposition for all the kind of matrices.

```{theorem, svd, name="Singular Value Decomposition (Sections 7.1, 7.2 from [@linearalgebragilbert])"}
\cite[Sections 7.1, 7.2]{linearalgebragilbert}
Any matrix $A \in \mathbb{R}^{n \times n}$ can be decomposed as
$$A = U\Sigma V^T$$
where $U \in \mathbb{R}^{n\times r}$ and $V \in \mathbb{R}^{m\times r}$ are orthogonal matrices and $\Sigma  \in \mathbb{R}^{r\times r}$ is a diagonal matrix. In particular, each $i^{th}$ column of $U$ and $V$ are respectively called left and right singular vectors of $A$ and the $i^{th}$ entry of $\Sigma$ is their associated singular value. Furthermore, $r$ is a natural number smaller then $m$ and $n$.
```

Another (equivaloent) definition of SVD is the following:
$$A=(U, U_0)\begin{pmatrix}
	\Sigma & 0 \\
	0 & 0
	\end{pmatrix} (V, V_0)^T.$$
The matrix $\Sigma$ is a diagonal matrix with $\Sigma_{ii}=\sigma_i$ being the singular values (which we assume to be sorted $\sigma_1 \geq \dots \geq \sigma_n$).

The matrices $(U, U_0)$ and $(V, V_0)$ are unitary matrices, which contain a basis for the column and the row space (respectively $U$ and $V$) and the left null-space and right null-space (respectively $U_0$ and $V_0$). Oftentimes, it is simpler to define the SVD of a matrix by simply discarding the left and right null spaces, as $A=U\Sigma V^T$, where $U,V$ are orthogonal matrices and
$\Sigma \in \mathbb{R}^{r \times r}$ is a diagonal matrix with real elements, as we did in Theorem \@ref(thm:svd).


Similarly to how eigenvalues and eigenvectors have been defined previously, for each pair of left-right singular vector, and the associated singular value, the following equation stands:
$$Av = \sigma u.$$

If we consider the Singular Value Decomposition (SVD) under a geometric perspective, we can see any linear transformation as the result of a rotation, a scaling, and another rotation. Indeed, if we compute the product between a matrix $A \in \mathbb{R}^{n\times m}$ and a vector $x \in \mathbb{R}^m$
$$Ax = U\Sigma V^Tx = (U(\Sigma (V^Tx))).$$
$U$ and $V^T$, being orthogonal matrices, only rotate the vector without changing its magnitude, while $\Sigma$, being a diagonal matrix, alters its length.


It is interesting to note that the singular values of $A$ - denoted as $\{\sigma_1,..., \sigma_r\}$ - are the square roots $\{\sqrt{\lambda_1},..., \sqrt{\lambda_r}\}$ of the eigenvalues of $AA^T$ (or $A^TA$) and that the left and right singular vectors of $A$ - denoted as $\{u_1, ..., u_r\}$ and $\{v_1, ..., v_r\}$ - are respectively the eigenvectors of $AA^T$ and $A^TA$.

The fact that each matrix can be decomposed in terms of its singular vectors and singular values, as in the theorem above, makes the relationship between singular values - singular vectors of a matrix and eigenvalues - eigenvectors of its products with the transpose clearer:
$$AA^T = (U\Sigma V^T)(U\Sigma V^T)^T = U\Sigma V^TV\Sigma U^T = U\Sigma ^2U^T;$$
$$A^TA =
(U\Sigma V^T)^T(U\Sigma V^T) =
V\Sigma U^TU\Sigma V^T = V\Sigma ^2V^T.$$

Note that the matrices $AA^T$ and $A^TA$ are symmetric matrices and so, for the Spectral theorem, we can always find an eigendecomposition. Moreover, note that they have positive eigenvalues: being the square roots of real positive eigenvalues, the singular values of a real matrix are always real positive numbers.

As the left and right singular vectors are eigenvectors of symmetric matrices, they can be chosen to be orthogonal as well. In particular, the left singular vectors of a matrix span the row space of the matrix, and the right singular vectors span the column space.

```{definition, column_row_space, name="Column (row) Space (Definition 8.1 page 192 [@schlesinger])"}
 \cite[Definition 8.1 page 192]{schlesinger} Let $A$ be a $\mathbb{R}^{n\times m}$ matrix. The column (row) space of $A$ is the space spanned by the column (row) vectors of $A$. Its dimension is equal to the number of linearly independent columns (rows) of $A$.
```

The number $r$ of singular values and singular vectors of a matrix is its rank.

```{definition, rank, name="Rank of a matrix (Definition 8.3, Proposition 8.4 page 193-194 from [@schlesinger]) "}
The rank of a matrix is the number of linearly independent rows/columns of the matrix. If the matrix belongs to the $\mathbb{R}^{n\times m}$ space, the rank is less or equal than $min(n,m)$. A matrix is said to be full rank if its rank is equal to $min(n,m)$.
```


The dimension of the null-space is the number of linearly-dependent columns. For a rank $k$ matrix, the Moore-Penrose pseudo-inverse is defined as $\sum_{i}^k \frac{1}{\sigma_i}u_i v_i^T$. Another relevant property of SVD is that the nonzero singular values and the corresponding singular vectors are the nonzero eigenvalues and eigenvectors of the matrix
$\begin{pmatrix}
	0 & A \\
	A^T & 0
	\end{pmatrix}$:

$$ \begin{pmatrix}
	0 & A \\
	A^T & 0
	\end{pmatrix} \begin{pmatrix}
	u_i \\
	v_i
	\end{pmatrix} .
	= s_i \begin{pmatrix}
	u_i \\
	v_i
	\end{pmatrix}
	$$

With $s(A)$ or simply with $s$   we denote the sparsity, that is, the maximum number of non-zero elements of the rows.


### Singular vectors for data representation
<!--
# TODO Adjust citations and labels
-->
Singular values and singular vectors provide important information about matrices and allow to speed up certain kind of calculations. Many data analysis algorithms, such as Principal Component Analysis, Correspondence Analysis, and Latent Semantic Analysis that will be further investigated in the following sections, are based on the singular value decomposition of a matrix.

To begin with, the SVD representation of a matrix allows us to better understand some matrix norms, like the spectral norm and the Frobenius norm [@linearalgebragilbert].

::: {.definition #spectral_norm name="l2 (or Spectral) norm"}
 Let $A  \in \mathbb{R}^{n\times m}$ be a matrix. The $l_2$ norm of $A$ is defined as $\|A\|_2 = max_{x }\frac{\|A x \|}{\|x\|}$.
:::


It is pretty easy to see that $\|A\|_2 = \sigma_{max}$, where $\sigma_{max}$ is the greatest singular value of $A$. In particular, if we consider again the matrix $A =U \Sigma V ^T$ as a linear transformation, we see that $U$ and $V ^T$ only rotate vectors $||U x ||=||x ||$, $||V x ||=||x ||$ while $\Sigma$ changes their magnitude $||\Sigma x || \leq \sigma_{max}||x ||$. For this reason, the $l_2$ Norm of a matrix is also referred to as the Spectral Norm. During the rest of the work we will also use the notation $||A ||$ to refer to the Spectral Norm.

Another important matrix norm that benefits from SVD is the Frobenius norm, defined in the following way.

::: {.definition name="Frobenius norm"}
Let $A  \in \mathbb{R}^{n\times m}$ be a matrix. The Frobenius norm of $A$ is defined as $\|A\|_F = \sqrt{\sum_i^n\sum_j^m a_{ij}^2}$.
:::


It can be shown that also this norm is related to the singular values.
::: {.proposition}
The Frobenius norm of a matrix $A  \in \mathbb{R}^{n\times m}$ is equal to the square root of the sum of squares of its singular values.
$$\|A\|_F = \sqrt{\sum_i^r \sigma_{i}^2}$$
:::


```{proof}
$$||A ||_F = \sqrt{\sum_i^n\sum_j^na_{ij}^2} = \sqrt{Tr[A A ^T]} = \sqrt{Tr[(U \Sigma V ^T)(U \Sigma V )^T]} =$$ $$\sqrt{Tr[U \Sigma V ^TV \Sigma U ^T]} = \sqrt{Tr[U \Sigma \Sigma U ^T]} = \sqrt{Tr[U \Sigma ^2U ^T]} = \sqrt{\sum_{i=1}^{n}\sigma^2}$$
From the cyclic property of the trace $Tr[A B ] = Tr[B A ]$ it follows that $Tr[U \Sigma ^2U ^T] = Tr[U ^TU \Sigma ^2] = Tr[\Sigma ^2]$, which is the sum of the squares of the singular values $\sum_{i=1}^{n}\sigma^2$.
```


Another interesting result about the SVD of a matrix is known as the Eckart–Young–Mirsky theorem.
```{theorem, eckart-young-mirsky, name="Best F-Norm Low Rank Approximation"}
 \cite{eckart1936approximation}\cite{mirsky1960symmetric}
Let $A  \in \mathbb{R}^{n \times m}$ be a matrix of rank $r$ and singular value decomposition $A  = U \Sigma V ^T$. The matrix $A ^{(k)} = U ^{(k)}\Sigma ^{(k)}V ^{(k)T}$ of rank $k \leq r$, obtained by zeroing the smallest $r-k$ singular values of $A$, is the best rank-k approximation of $A$.
 Equivalently, $A _k = argmin_{B :rank(B )=k}(\|A  - B  \|_F)$.
 Furthermore, $min_{B :rank(B )=k}(\|A  - B  \|_F) = \sqrt{\sum_{i=k+1}^r{\sigma_i}}$.
```


To get a clearer understanding of this result, we could rewrite $A  = U \Sigma V ^T = \sum_i^r\sigma_iu _iv _i^T$ and notice that matrix $A$ is the sum of $r$ matrices $u _iv _i^T$ each scaled by a scalar $\sigma_i$.

In practice, SVD decomposes matrix $A$ in matrices of rank one, ordered by importance according to the magnitude of the singular values: the smaller the $\sigma_i$, the smaller is the contribution that the rank-1 matrix gives to the reconstruction of $A$.
When the smallest singular values are set to 0, we still reconstruct a big part of the original matrix, and in practical cases, we will see that matrices can be approximated with a relatively small number of singular values.


Unfortunately though, calculating the singular vectors and singular values of a matrix is a computationally intensive task. Indeed, for a matrix $A  \in \mathbb{R}^{n\times m}$ the cost of the exact SVD is $O\left( min(n^2m, nm^2) \right)$. Recently, there have been developed approximate methods that compute the Eckart-Young-Mirsky approximations of matrices in
time $O(knm)$, where k is the rank of the output matrix [@partridge1997fast], or in times that scale super-linearly on the desired rank and one dimension of the input matrix \@ref(bhojanapalli2014tighter).

## Useful theorems around linear algebra

- [Gershgorin circle theorem](https://en.wikipedia.org/wiki/Gershgorin_circle_theorem)
- [Perron-Frobenius theorem](https://en.wikipedia.org/wiki/Perron%E2%80%93Frobenius_theorem)
- [Sherman-Morrison formula](https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula)
-


## Inequalities


[From here](https://en.wikipedia.org/wiki/Bernoulli%27s_inequality#Related_inequalities).

```{theorem, bernoulli, name="Bernoulli inequalities"}

- Bernoulli inequality:  for $\forall n \in \mathbb{N}, x\geq -1$ $$(1+x)^n \geq 1+nx$$ (Reader, check what happens if $n$ even or odd!)

- Generalized bernoully inequality: $\forall r \in \mathbb{R} \geq 1$ or $r \leq 0$, and $x\geq -1$ $$(1+x)^r \geq 1+rx$$.
 For $0 \leq r\leq 1$, $$(1+x)^r \leq 1+rx$$

- Related inequality: for any $x,r \in \mathbb{R}$, and $r>0>$: $$(1+x)^r \leq e^{rx}$$

```

```{theorem, jensen, name="Jensen inequality"}
Let $X$ be a random variable with $\mathbb{E}|X| \leq \infty$ and $g : \mathbb{R}\to \mathbb{R}$ a real continious convex function. Then
$$g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)].$$

```

A [mnemonic trick](https://math.stackexchange.com/questions/2364116/how-to-remember-which-function-is-concave-and-which-one-is-convex) to remember the difference between convex and concave. Conv*EX* ends with *EX*, as the word *EX*ponential, [which is convex](https://math.stackexchange.com/questions/702241/how-to-prove-that-ex-is-convex).


```{theorem, ljapunov, name="Ljapunov's inequality"}
For any real $0 \leq p \leq q$ and any real random variable, $\|X\|_p \leq \|X\|_q$.
```

```{theorem, holder, name="Hölder's inequality"}
Let $p,q>1$ that satisfy $\frac{1}{p} + \frac{1}{q} = 1$. If $\|X\|_p \leq \infty$ and $\|X\|_q$ then
$$\mathbb{E}[|XY|] \leq \|X\|_p \dot \|X\|_q.$$

```

```{theorem, minkowski, name="Minkowski's inequality"}
Let $p>1$, $\|X\|_p \leq \infty$ and $\|Y\|_p \leq \infty$. Then:
$$\|X+Y\|_p \leq \|X\|_p + \|Y\|_p.$$
```

<!-- add cite to bristol's lecture notes on martingales -->


## Trigonometry
Always have in mind the following Taylor expansion:

```{theorem, taylor-exponential, name="Taylor expansion of exponential function"}
$$e^{x} = \sum_{k=0}^\infty \frac{x^k}{k!}$$
  Note that this series is convergent for all $x$
```

From that, it is easy to derive that:

$$e^{\pm ix}= cos (x) \pm i sin(x) $$

::: {.theorem #euler name="Nunzio's version of Euler's identity"}
For $\tau = 2\pi$,

$$e^{i\tau} = 1 $$
:::

Because of this, we can rewrite $\sin$ and $\cos$ as:

- $\cos(x) = \frac{e^{ix} + e^{-ix}}{2}$
- $\sin(x)= \frac{e^{ix} - e^{-ix}}{2i}$

Note that we can do a similar thing of \@ref(thm:taylor-exponential) for matrices. In this case, we *define* the exponential of a matrix via it's Taylor expansion:

$$e^A = \sum_{k=0}^\infty \frac{A^k}{k!}$$
The matrix exponential has the following nice properties [@symmetryquantum]:

- $(e^A)^\dagger = e^{A^\dagger}$
- $e^{A\otimes I} = e^A \otimes I$
- if $[A,B] = 0$, then $e^Ae^B = e^{A+B}$.
- $Ue^AU^\dagger = e^{UAU\dagger}$
- $det(e^A)= e^{Tr[A]}$


#### Useful equalities in trigonometry {#subsec:trigonometry}

- $\sin(a)\cos(b) =  \frac{1}{2}(\sin(a+b)+\sin(a-b)) = 1/2(\sin(a+b)-\sin(b-a) )$
- $2 \cos x \cos y = \cos (x+y) + \cos (x-y)$

```{exercise}
Derive an expression for $\cos(a+b)$.
```
```{proof}
Recall that $e^{ix} = cos(x)+isin(x)$,
$$e^{i(A+B)} = cos(a+b)+isin(a+b) = e^{iA}e^{iB} = (cos(a)+isin(a))(cos(b)+isin(b))$$
$$cos(a+b)+isin(a+b) = cos(a)cos(b) + cos(a)+isin(b)+isin(a)cos(b) - sin(b)sin(a)$$
$$cos(a+b)+isin(a+b) = cos(a)cos(b) + cos(a)isin(b)+isin(a)cos(b) - sin(b)sin(a)$$
  From this, it follows
```

# Series

<!--
# TODO Finish section on series (geometric, telescopic, and so on.. )
# This should be a straight-to-the point section on series, with specific focus on CS.
# These series pops up very often in the analysis of algorithms so having them here is really helpful.
# Ask scinawa the notes he already have on this.
# labels: good first issue, help wanted
-->

<!-- The geometric series is a series with a constant ratio between successive terms. The term $r$ is called the *common ratio* and the term $a$ is the first element of the series.  -->

<!-- The sum of the geometric series is a power series for the function $f(x)=\frac{1}{1-x}$. In sum notation we can write $\frac{1}{1-x}=\sum_{n=0}^\infty x^n$.  -->

<!-- Note that this equal symbol is valid only under certain conditions, i.e. that $x$ is within the convergence ratio of the series expansion.  -->

<!-- ```{theorem} -->
<!-- \textcolor{red}{when $|x|\leq 1$?}. Let $|x|\leq 1$. Then,  -->
<!-- $$s= \sum_{k=0}^{n-1} ax^k = a\frac{1-x^{n}}{1-x} $$  -->
<!-- ``` -->

<!-- ```{proof} -->
<!-- \begin{itemize} -->
<!-- \item Start:  -->
<!--     $$s = a+ax+ax^2\dots ax^{n} $$ -->
<!--     \item Multiply by $x$ on both sides: -->
<!--     $$xs = ax+ax^2 \dots ax^{n+1} $$ -->
<!--     \item Subtract $xs$ from $s$: -->
<!--     $$s - xs = a- ax^n $$ -->
<!--     $$ s(1-x) = (a-ax^n) $$ -->
<!--     \item Conclude that: -->
<!--     $$ s = \frac{(a-ax^n)}{(1-x)} = a(\frac{1-x^n}{1-x})$$ -->
<!-- \end{itemize} -->
<!-- ``` -->

<!-- \textbf{Remark} -->
<!-- If $n \to \infty$ than $x^n = 0$ and ...  we have the most common version of  -->
<!-- $$\frac{1}{1-x} $$ -->

<!-- Also, TOADD: $\frac{1}{x-a} = ?$ -->

<!-- <!-- Perhaps include the beautiful image of wikipedia of the geometric series! -->

<!-- So, if we wanted to find the Taylor series of $\frac{1}{x}$ we would only need to find some way of representing the new function via the old one. This can be done by changing $x$ to $(1-x)$ in the sum. So our new series is $\frac{1}{1-(1-x)}=\frac{1}{x}=\sum_{n=0}^\infty (1-x)^n$. -->


<!-- - $\frac{1}{x} = \sum_n^\infty (-1)^n (-1 + x)^n$ -->
<!-- - $\frac{1}{x-1} = -1 -x -x^2 -x^3 = \sum_n^\infty -x^n$ -->
<!-- - $\frac{1}{x+1} = 1 -x + x^2 - x^3 + x^4  =$ -->
<!-- - $\frac{1}{1-x} = \sum_n x^n$ -->
<!-- - $\frac{1}{1+x} = 1 - x + x^2 \dots = t\sum_n (-1)^n x^n$ -->


<!-- #### Telescopic series -->


# Probability


## Measure theory


```{definition, sigma-algebra, name="Sigma algebra"}
Let $\Omega$ be a set, and $\Sigma$ be a subset of the power set of $\Omega$ (or equivalently a collection of subsets of $X$). Then $\Sigma$ is a $\sigma$-algebra if:

- $\emptyset\in \Sigma$,
- $\Sigma$ is closed under countable union,
- $\forall S \in \Sigma, \overline{S} \in \Sigma$.

```

Observe that thanks to de Morgan's law, we can equivalently define the sigma algebra to be closed under countable intersection. Oftentimes, it's common to conflate $\Sigma$ and $(\Omega, \Sigma)$, and call both $\sigma$-algebra.

```{definition, measurable-space, name="Measurable space"}
Let $\Omega$ be a set, and $\Sigma$ a $\sigma$-algebra. The tuple $(\Omega, \Sigma)$ is a measurable space (or Borel space).
```


```{definition, measurable-function, name="Measurable function"}
Let $(\Omega, \Sigma)$, and $(Y, T)$ two different measurable space. A function $f : \Omega \mapsto Y$ is said to be measurable if for every $E \in T$:
  $$f^{-1}(E):=\{x \in \Omega | f(x) \in E \} \in \Sigma$$
```
A measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in [direct analogy](https://en.wikipedia.org/wiki/Measurable_function) to the definition that a continuous function between topological spaces preserves the topological structure: the preimage of any open set is open.


<!-- https://ece.iisc.ac.in/~parimal/2015/proofs/lecture-17.pdf -->
```{definition, continious-function, name="Continious function"}
Let $(X, \mathbb{X}), (Y, \mathbb{Y})$ two topological spaces. Let $f$ be a function between two topological spaces $f : X \mapsto Y$ is said to be continious if the inverse image of every open subset of $Y$ is open in $X$. In other words, if $V  \in \mathbb{Y}$, then its inverse image $f^{-1}(V) \in \mathbb{X}$
```


```{definition, measure-space, name="Measure space"}
The tuple $(\Omega, \Sigma, \mathbb{P})$ is a measure space if:

- $(\Omega, \Sigma)$ is a measurable space.
- $\mu(E)$ is a measure on $(\Omega, \Sigma)$:
  - $\mu : \Sigma \mapsto \mathbb{R}+\{-\infty, +\infty\}$
  - non-negativity: $\mu(E) \geq 0 \forall E \in \Sigma$
  - Null empty set $\mu(\emptyset )= 0$
  - Coutable additivity (or $\sigma$-additivity): for all countable collections $\{E_k \}_{k=1}^\infty$ of pariwise disjoint sets in $\Sigma$,
  $$\mu \left(\cup_{k=1}^\infty E_k\right) = \sum_{k=1}^\infty \mu(E_k)$$

```


```{definition, probability-space, name="Probability space"}
The tuple $(\Omega, \Sigma, \mathbb{P})$ is a probability space if:

- $(\Omega, \Sigma)$ is a $\sigma$-algebra  ($\Omega$ is the set of *outcomes* of the experiment, and $\Sigma$ is the set of *events*)
- $\mathbb{P}$ is a measurable function:
  - $\mathbb{P} : \Sigma \mapsto [0,1]$
  - Null empty set.
  - $\mu$ is countably additive.
 - $\mathbb{P}(\Omega)=1$

```
I.e. a probability space is a measure space where the measurable function on $\Omega$ is $1$.


```{definition, name="Complete probability space"}
For $B \subset \Sigma$ s.t. $\mathbb{P}(B)=0$, a $(\Omega, \Sigma, \mathbb{P})$ probability space is complete if $\forall A \subset B$, $A \in \Sigma$.
```


```{definition, equivalence-prob-measure, name="Equivalence between probability measures"}
Let $(\Omega, \Sigma, \mathbb{P}), (\Omega, \Sigma, \mathbb{Q})$ two probability space with the same $\Omega$ and $\Sigma$. We say that $\mathbb{P}$ and $\mathbb{Q}$ are equivalent iif for every $A \in \Sigma$,  $\mathbb{P}(A)=0 \Leftrightarrow \mathbb{Q}(A)=0$.
```
It basically means that the two measures agree on the possible and impossible events (even if it is pretty strange to call them equivalent).


```{definition, random-variable, name="Random variable"}
A (real-valued) random variable on a probability space $(\Omega, \Sigma, \mathbb{P})$ is a measurable function $X: \Omega \mapsto \mathbb{R}$.
```


<!-- - Joint probability of two events $P(a \cap b) = P(a,b) = P(a|b|)P(a)$ -->
<!-- - Marginal probability $p(a) = \sum_b p(a,b) = \sum_b p(a|b)p(b)$ -->
<!-- - Union of two events $p(a \cup b)$ -->
<!-- - Sum rule -->
<!-- - Rule of total probability  -->
<!-- - Conditional probability -->
<!-- - Bayes' Theorem: $$ p(A=a | B=b) = \frac{p(A=a|B=b)}{p(B=b)} = \frac{p(A=a)p(B=b|A=a)}{\sum_{a'} p(A=a')p(B=b|A=a')}$$ -->

Remember [that](https://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/), for a list of numbers $x_1, x_n$,

- The mode can be defined as $\arg\min_x \sum_i |x_i - x|^0$
- The median can be defined as $\arg\min_x \sum_i |x_i - x|^1$
- The mean can be defined as $\arg\min_x \sum_i |x_i - x|^2$.


#### Union bound

The union bound is used to show that the probability of union (i.e. at least one of them happens) of a finite or countable set of events is less than or equal to the sum of the probabilities of the events.

<span id="unionbound"> </span>

```{theorem, unionbound, name="Union bound"}
$\forall$ events $A_1 \dots A_n \in \Sigma$:
$$P(\cup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i)$$
```


<!-- ```{proof} -->
<!-- It would be nice to have a proof by induction here, which is the standard proof for showing the union bound, based on the axioms of probability theory.  -->
<!-- ``` -->

```{exercise}
In Erdős–Rényi graphs $G(n,p)$, (that is, a graph with $n$ nodes with probability $p$ that each of the two nodes are connected). We define the event $B_n$ as the event where a graph $G(n,p)$ has at least one isolated node. Show that $P(B_n) \leq n(1-p)^{n-1}$.
```
```{proof}
Let $A_i, i\in[n]$ the event that node $i$ is isoldated. Its probability, from the definition of $G(n,p)$ is $(1-p)^{n-1}$, because there might be an edge with probability $p$ with other $n-1$ nods. From this, applying directly the union bonund we obtain an upper bound on the probability that there is at least one isoldated node is in the graph:
$$P(B_n) = P(\cup_{i=1}^n A_i) \leq \sum_i P(A_i) \leq nP(A_i) = n(1-p)^{n-1}$$
```

```{exercise, example-algorithm-unionbound}
Suppose we run 4 times a randomized algorithm, with success probability $1-\delta$. Can you bound the probability that we never fail using the union bound?
```

```{proof}
Let $f_i$ the event that we fail running our algorithm at time $i$.
We know that the failure probability $f_i$ is $\delta$ for all $i \in [4]$. Thanks to the union bound we can bound the probability that we fail at least once:
$P(\cup_i^k f_i ) \leq \sum_i^4 \delta = 4\delta$.
It follows that the have 4 success in a row is *lower* bounded by $1-4\delta$.

Note that we could have also bypassed the union bound and compute this quantity analitically, as the probability of getting 4 success in a row would be $(1-\delta)^4$, which we can compute with the binomial theorem \@ref(thm:binomial-theorem).
```


<!-- Maybe we can be more precise in this exercise, by saying the interpretation of the values of difference between the analytic formula and the bound. Also we should say that the union bound might give "probabilities" bigger than one (i.e. if $\delta < 1/4$ in this case). Overall, I'm not satisfied with the level of clarity of this example, but i think it's nice to have. -->

<span id="variance"> </span>

::: {.definition #variance name="Variance"}
\begin{align}
\operatorname{Var}(X) &= \operatorname{E}\left[(X - \operatorname{E}[X])^2\right] \\[4pt]
&= \operatorname{E}\left[X^2 - 2X\operatorname{E}[X] + \operatorname{E}[X]^2\right] \\[4pt]
&= \operatorname{E}\left[X^2\right] - 2\operatorname{E}[X]\operatorname{E}[X] + \operatorname{E}[X]^2 \\[4pt]
&= \operatorname{E}\left[X^2 \right] - \operatorname{E}[X]^2
\end{align}
:::

```{exercise}
How can we express the variance as expectation of quantum states? What quantum algorithm might we run to estimate the variance of a random variable $M$?
  $$\braket{\psi|M^2|\psi} - (\braket{\psi|M|\psi})^2 $$
  Discuss.
```


```{definition, expofamily, name="Exponential Family [@murphy2012machine]"}
A probability density function or probability mass function $p(v|\nu)$ for $v = (v_1, \cdots, v_m) \in \mathcal{V}^m$, where $\mathcal{V} \subseteq \mathbb{R}$, is a $\sigma$-algebra over a set $X$ 	$\nu \in \mathbb{R}^p$ is said to be in the exponential family if it can be written as:
 $$p(v|\nu) := 	h(v)\exp \{ o(\nu)^TT(v) - A(\nu) \}$$
 where:

- $\nu \in \mathbb{R}^p$ is called the \emph{canonical or natural} parameter of the family,
- $o(\nu)$ is a function of $\nu$  (which often is just the identity function),
- $T(v)$  is the vector of sufficient statistics: a function that holds all the information the data $v$ holds with respect to the unknown parameters,
- $A(\nu)$ is the cumulant generating function, or log-partition function, which acts as a normalization factor,
- $h(v) > 0$ is the \emph{base measure} which is a non-informative prior and de-facto is scaling constant.

```


#### Bias-variance tradeoff
[Here](http://fourier.eng.hmc.edu/e176/lectures/probability/node9.html) is a nice reference to understand the bias-variance tradeoff


### Boosting probabilities with "median lemma" (or powering lemma )

In this section we discuss the following, widely known result in CS. It's used not only in writing algorithms, but also in complexity theory.

```{lemma, powering-lemma, name="Powering lemma [@jerrum1986random]"}
Let $\mathcal{A}$ be a quantum or classical algorithm which aims to estimate some quantity $\mu$, and whose output $\widetilde{\mu}$ satisfies $|\mu-\widetilde{\mu} |\leq \epsilon$ except with probability $\gamma$, for some fixed $\gamma \leq 1/2$. Then, for any $\gamma > 0$ it suffices to repeat $\mathcal{A}$ $O(\log 1/\delta)$ times and take the median to obtain an estimate which is accurate to within $\epsilon$ with probability at least $1-\delta$.
```


<!-- Suppose the following Lemma: -->
<!-- ```{theorem} -->
<!-- There exist an algorithm $A$ output $\overline{\mu}$ that estimates $\mu$ with probability $1/16$ with error $\epsilon m$. That is: -->
<!-- $$P[ |\mu - \overline{\mu} | \leq m \epsilon ] \geq \frac{1}{16}  $$ -->
<!-- ``` -->

<!-- Using the median trick, we can boost the probablity of success like this: -->
<!-- ```{theorem, name="[@betterupperboundSDP"} -->
<!-- There exist an algorithm $A'$ that estimates $\mu$ with probability $1-\delta$ with error $\epsilon m$. It is obtained by repeating $O(\log(1/\delta)$ the algorithm $A$.  -->
<!-- ``` -->

<!-- ``` -->
<!-- Let's pick the median of $K=O(\log(1/\delta)$ repetitions. Let's $F_i$ for $i \in [K]$ the output of previous algorithm. Let $z_K$ denote the median.  -->
<!-- $$Pr(|z_k - \mu| \geq \epsilon m) = Pr(z_k - \geq \epsilon m + \mu ) + Pr(z_k  \leq  \mu - \epsilon m )$$ -->

<!-- We can upper bound the first term as: \textcolor{red}{detail better why first passage} -->
<!-- \begin{align*} -->
<!--   (*) &\leq  \sum_{I \subseteq [K]: |I| \geq K/2} \prod_{i \in I} \mathrm{Pr}\big( (F)_i \geq \epsilon m + \mu \big) \\ -->
<!--       &\leq  (|\{I \subseteq [K]: |I| \geq K/2 \}| ) \left(\frac{1}{16}\right)^{K/2} \\ -->
<!--       &=  2^{K-1}\left(\frac{1}{4}\right)^K \\ -->
<!--       &\leq \frac{1}{2}\textcolor{red}{\left(\frac{1}{2}\right)^{\log_2(1/\delta)}} = \frac{1}{2} \delta. -->
<!-- \end{align*} -->
<!-- Analogously, one can show that $Pr\big( z_K \leq   - \epsilon m  + \mu \big) \leq \frac{1}{2} \delta$. Hence -->
<!-- \[\mathrm{Pr}\big(|z_K - \mu| \geq \epsilon m\big) \leq \delta. \] -->
<!-- ``` -->


## Markov chains
Useful resources: [here](https://www.probabilitycourse.com/chapter11/11_2_4_classification_of_states.php#:~:text=A%20Markov%20chain%20is%20said%20to%20be%20irreducible%20if%20all,always%20stay%20in%20that%20class.), [here](http://www.columbia.edu/~ww2040/4701Sum07/4701-06-Notes-MCII.pdf).


(ref:serfozo2009basics) [@serfozo2009basics]


```{definition, markov-chain, name="Markov chain (ref:serfozo2009basics)"}
Let  $(X_t)_{t \in I}$ be a stochastic process defined over a probability space $(\Omega, \Sigma, \mathbb{P})$, for a countable set $I$, where $X_t$ are random variables on a set $\mathcal{S}$ (called state space). Then $(X_t)_{t \in I}$ is a Markov chain if, for any $j \in \mathcal{S}$ and $t \geq 0$, it holds that
$$\mathbb{P}[X_{t+1} = j | X_0, X_1, \dots X_t] = \mathbb{P}[X_{t+1} =j | X_t]$$
  and for all $j,i \in \mathcal{S}$, it holds that
$$\mathbb{P}[X_{t+1} = j | X_t = i] = p_{ij}$$,
where $p_{ij}$ is the transition probability for the Markov chain to go from state $i$ to state $j$.
```

Less formally, a Markov chain is a stochastic process with the Markov property, for which we can just use a matrix $P$ to identify its transition probability. Most of the time, we will discretize the state space $\mathcal{S}$, so we can label elements of $\mathcal{S}$ with integers $i \in [|\mathcal{S}|]$. This fact will allow us to conflate the (push-forward) measure $\mathcal{P}$ on $\mathcal{S}$ and the matrix $P$.

A state $j$ is said to be *accessible* from $i$ (written as $i \mapsto j$) if $P_{ij}^t > 0$ for some $t$, where $P^t$ is the $t$-th power of the transition matrix $P$. A communication class is an equivalence releation between states (relatively simple to prove) where two states $j,i$ are said to communicate if they are mutually accessible.


```{definition, irreducible-markov-chain, name="Irreducible markov chain"}
A Markov Chain $(X_t)_{t \in I}$ is irreducible if and only if

- there exist some integer $t \in I$ such that $p^t_{ij} > 0$ for all $i,j \in \mathcal{S}$
  there exist some integer $t \in I$ such that $P[X_t =j| X_0 = i] > 0$, for all $i,j \in \mathcal{S}$
- there is only one communication class.

The previous conditions are equivalent.
```

In terms of random walks, irreducibility means that: if the graph is undirected, the graph is has only one connected component (i.e. is connected), and if the graph is directed, the graph is strongly connected.

## Distributions


[This](http://web.mit.edu/urban_or_book/www/book/chapter7/7.1.3.html) is a beautiful guide that shows you how to draw samples from a probability distribution.

- [Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) 
- 


## Concentration inequalities

Take a look at [this](https://www.youtube.com/watch?v=Rd8LQbXhWvM) and [this](http://www.stat.rice.edu/~jrojo/PASI/lectures/TyronCMarticle.pdf). Also recall that the Union bound was presented in the section dedicated for probability theory, i.e. Theorem \@ref(thm:unionbound).


### Markov inequality

The Markov inequality is an *upper bound for the probability that a
non-negative function of a random variable*, that is greater than or
equal to a positive constant. Especially in analysis, people refer to it
as Chebyshev's inequality (sometimes, calling it the *first* Chebyshev
inequality, while referring to the "usual" [Chebyshev's inequality](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality) as the second Chebyshev inequality or Bienaymé–Chebyshev inequality). 

```{theorem, markov, name="Markov inequality"}
For all non-negative random variable, and $a > 0$,  we have that:

- $Pr(X  \geq a) \leq \frac{E[X]}{a}$
- $Pr(X  \geq aE[X]) \leq \frac{1}{a}$

```

```{proof}
Observe that :
$$E[X] = P(X < a) \cdot E[X|X<a] +  P(X > a) \cdot E[X|X>a]$$
As both of these expected values are bigger than zero, (using the nonnegativity hypothesis) we have that
$$E[X] \geq P(X > a) \dot E[X|X>a] $$

Now is easy to observe that $E[X|X>a]$ is at least $a$, and by rearranging we obtain that:
$$ \frac{E[X]}{a} \geq P(X > a) $$

The second statement of the theorem follows from substitution, i.e. setting $b=aE[X]$ and using the previous statement on $Pr(X  \geq b)$.
```

A very useful corollary of Markov inequality is the following. 

```{theorem, coro-markov, name="Corollary of Markov inequality"}
Let $f$ be a monotone increasing (or noll) function on a space $I$, and define the random variable on $Y$.
$$ P(Y \geq b) \leq \frac{E[f(Y)]}{f(b)} $$
```

### Chebyshev inequality

This inequality tells us about the probability of finding the random
variable $X$ away from the mean $\mathbb{E}[X]$ is bounded by the variance of
$X$.

```{theorem, chebyshev, name="Chebyshev inequality"}
Let $X$ be a random variable with finite mean $\mu$ and variance $\sigma$. For $\epsilon > 0$:
  
  $$Pr[|X - \mathbb{E}[X]| \geq \epsilon]\leq \frac{\sigma^2}{\epsilon^2}$$

Moreover, if  $k = \epsilon /\sigma$ we can replace $\epsilon$ with $k\sigma$ and obtain:

 $$Pr[|X - \mathbb{E}[X]| \geq k\sigma]  \leq \frac{1}{k^2}$$

```

```{proof}
Observe that $(X-\mu)^2$ is a non-negative random variable. Therefore,
$$P(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2)$$
Now since $(X-\mu)^2$ is a non-negative random variable, we can apply Markov inequality to get :
  $$P(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2) \leq \frac{E[(X-\mu)^2]}{\epsilon^2}$$
  $$(|X-\mu| \geq \epsilon) = P((X-\mu)^2 \geq \epsilon^2) \leq \frac{[Var(X)]}{\epsilon^2}$$
```

<!-- # check if linearly smaller is correct/not confusing here-->

It is very useful to see what happen when we define a new random
variable $Y$ as the sample mean of $X_1 \dots X_n$ other random
variables (iid) indipendent and identically distributed:
$Y= \frac{1}{n}\sum_i^n X_i$. The expected value of $Y$ is the same as
the expected value of $X$, but the variance is now linearly smaller in
the number of samples:

$$E[Y] = \frac{1}{n} \sum_i^n E[X_i] = \mathbb{E}[X_i] \text{ for any } i$$
$$Var[Y] = \frac{1}{n^2} \sum_i^n \text{Var}[X_i] = \frac{Var[X_i]}{n}  \text{ for any } i$$

This allows us to obtain the following bound:

:::{.theorem #chebyshev-mean name="Chebyshev inequality for sample mean"}
Let $Y= \frac{1}{n}\sum_i^n X_i$. Then,

$$Pr[|Y - E[Y]| \geq \epsilon]\leq \frac{\sigma^2}{n\epsilon^2}$$
:::


### Weak Law of large numbers

```{theorem, wlln, name="Weak Law of large numbers"}
Let $X_1, X_2, \dots, X_n$ be i.i.d random variables with a finite expected value $\mathbb{E}[X_i]=\mu \leq \infty$, and let $\overline{X}$ be the average $\frac{1}{n}\sum_i^n X_i$. Then, for any $\epsilon > 0$, we have that:

$$\lim_{n\to +\infty} P\left( |\overline{X} - \mu | \geq \epsilon   \right) = 0$$

```

```{proof}
We know that $E[\overline{X}] = \mu$ and $Var(\overline{X}) = \frac{\sigma^2}{n}$. By Chebyshev Inequality for the sample mean (Theorem \@ref(thm:chebyshev-mean)):

$$P(|\overline{X}-\mu| > \epsilon) \leq \frac{Var(\overline{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2}$$

Trivially,  $\lim_{n \to \infty} \frac{\sigma^2}{n\epsilon^2} = 0$, concluding the proof. 

```

### Strong Law of Large Numbers

```{theorem, slln, name="(Strong) Law of large numbers"}
Let $X_1,X_2,X_3,\dots,X_n$ be i.i.d random variables with mean $\mu$. Let $\overline{X} = \sum_{i=1}^{n} X_i$ be the sample mean. Then, $\overline{X}$ converges almost surely to $\mu$:
$$P(\lim_{x \to -\infty}\overline{X}_n = \mu) = 1$$
```

:::{.remark}
The SLLN implies WLLN but not vice-versa.
:::

### Chernoff bound

From [here](https://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf) and [here](http://www.stat.cmu.edu/~arinaldo/Teaching/36709/S19/Scribed_Lectures/Jan29_Tudor.pdf), [here](https://polynomiallybounded.wordpress.com/2017/05/23/how-i-remember-the-chernoff-bound/),  and [here](https://www.probabilitycourse.com/chapter6/6_2_3_chernoff_bounds.php)

We focus on a restricted class of random variables, i.e. the case when our random variable is obtained as the sum of *indipendent* other random variables. Central limit theorem says that, as $n \to \infty$, the value $\frac{X-\mu}{\sigma}$ approaches the standard normal distribution $N(0,1)$. Hoewver, it does not tell any information on the rate of convergence.


:::{.theorem #chernoff-bound name="Chernoff bound"}
Let $X=\sum_i^n X_i$ where $X_i =1$ with probability $p_i$ and $X_i=0$ with probability $(1-p_i)$, and all $X_i$ are independent. Let $\mu=E[X] = \sum_i^n p_i$. Then:

- *Upper tail*: $P(X \geq (1+\delta)\mu) \leq e^-\frac{\delta^2}{2+\delta}\mu$ for all $\delta > 0$
- *Lower tail*: $P(X \leq (1-\delta)\mu) \leq e^{\mu\delta^2/2}$ for all $0 \leq \delta \leq 1$

:::

You can find a nice proof [here](https://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf).
<!-- ```{proof} -->

<!-- If $X = \sum_i^n X_i$ where $X_1,X_2, \dots, X_n$ are i.i.d variables, then since the MGF (Moment Generating Function) of the (independent) sum equals the product of the MGFs. Taking our general result from above and using this fact, we get: -->

<!-- \begin{equation} -->
<!-- P(X \geq k)\leq\min_{t>0}\frac{M_x(t)}{e^{tk}}=\min_{t>0}\frac{\prod_{i=1}^n M_{x_i}(t)}{e^{tk}} -->
<!-- \end{equation} -->

<!-- Let’s derive a Chernoff bound for $X∼Bin(n, p)$, which has the form $P(X \geq (1+\delta)\mu)$ for $\delta > 0$.  For example with $\delta = 4$ you may want to bound $P(X \geq 5*E[X])$. -->
<!--   Recall $X = \sum_i^n X_i$ where $X_i ~ Ber(p)$ are i.i.d, with $\mu = E[X] = np$. -->


<!-- \begin{align} -->
<!-- M_{X_i(t)} =&  E[e^tX_i] \\ -->
<!--            =& e^{t \dot 1}px_i(1)+e^{t \dot 0}px_i(0) \\ -->
<!--            =& pe^t + 1(1-p) \\ -->
<!--            =& 1 + p(e^t - 1) \\ -->
<!--            \leq&  e^{p(e^t - 1)} -->
<!-- \end{align} -->


<!-- Now using the result from earlier and plugging in the MGF for the $Ber(p)$ distribution, we get: -->


<!-- \begin{align} -->
<!-- P(X \geq k) \leq& \min_{t>0}\frac{\prod_i^n M_{X_i(t)}}{e^{tk}} \\ -->
<!--                =& \min_{t>0}\frac{(e^{p(e^t-1)})^n}{e^{tk}}\\  -->
<!--               =& \min_{t>0}\frac{e^{np(e^t-1)}}{e^{tk}}  \\ -->
<!--             =& \min_{t>0}\frac{e^{\mu(e^t-1)}}{e^{tk}}  -->
<!-- \end{align} -->


<!-- For our bound, we want something like $P(X \geq (1+\delta)\mu)$, so our $k = (1+\delta)\mu$. To minimize the RHS and get the tightest bound, the best bound we get is by choosing $t=ln(1+\delta)$ after some terrible algebra (take the derivative and set to 0). We simply plug in k and our optimal value of t to the above equation: -->


<!-- \begin{align} -->
<!-- P(X \geq (1+\delta)\mu)\leq &\frac{e^(\mu(e^ln(1+\delta)-1))}{e^(1+\delta)\mu \ln(1+\delta)} = \frac{e^\mu((1+\delta)-1)}{(e^(ln(1+\delta))^(1+\delta)\mu} \\ -->
<!-- = &\frac{e^(\delta\mu)}{(1+\delta)^(1+\delta)\mu} = (\frac{e^\delta}{(1+\delta)^(1+\delta)})^\mu -->
<!-- \end{align} -->

<!-- Again, we wanted to choose t that minimizes our upper bound for the tail probability. Taking the derivative with respect to t tells us we should plug in $t=ln(1+\delta)$ to minimize that quantity. This would actually be pretty annoying to plug into a calculator. We actually can show that the final $RHS$ is $\leq exp(\frac{-\delta^2\mu}{2+\delta})$ with some more messy algebra. Additionally, if we restrict $0 < \delta < 1$, we can simplify this even more to the bound provided earlier: -->
<!--   $$P(X \geq (1+\delta)\mu) \leq exp(\frac{-\delta^2\mu}{3})$$ -->
<!-- The proof of the lower tail is entirely analogous, except optimizing over $t < 0$ when the inequality flips. It proceeds by taking $t=ln(1-\delta).$ -->
<!-- We also get a lower tail bound: -->
<!--   $$P(X \leq (1-\delta)\mu)\leq \left(\frac{e^-\delta}{(1-\delta)^(1-\delta)} \right)\leq \left(\frac{e^-\delta}{e^-(\delta+\frac{\delta^2}{2})}\right)= \exp(\frac{-\delta^2\mu}{2})$$ -->
<!-- ``` -->


```{theorem, chernoff-bound2, name="Chernoff bound"}
Suppose $X_1, \dots, X_t$ are independent random variables taking values in $\{0,1\}$. Let $M_t= (X_1 + \dots X_t)/t$ denote their average value. Then for any $0 < \epsilon < 1$,

- (Multiplicative) $Pr[M_t - \mu \leq -\epsilon \mu] \leq \exp^{-\frac{t\mu\epsilon^2}{2}}$ and $Pr[M_t - \mu \geq \epsilon \mu] \leq \exp^{-\frac{t\mu\epsilon^2}{3}}$
- (Additive)  $Pr[M_t - \mu \leq -\epsilon ] \leq exp^{-2t\epsilon^2}$ and $Pr[M_t - \mu \geq \epsilon ] \leq \exp^{-2t\epsilon^2}$

```


:::{.remark}
*Trick:* if our random variables are not between $0$ and $1$, we can define $Y_i=X_i/max(X_i)$
:::


### Hoeffding inequality

```{lemma, Hoeffding, name="Hoeffding inequality"}
Let $X_1,\ldots,X_k$ be independent random variables bounded by the interval $[a, b]$. Define the empirical mean of these variables by $\overline{X} = \frac{1}{k} (X_1+\cdots+X_k)$, then
$$Pr(|\overline{X} - \mathbb{E}[X]|\leq \epsilon) \geq 1-2
	\exp\left(-\frac{2k\epsilon^2}{b-a} \right).$$
	Consequently, if $k\geq (b-a)\epsilon^{-2}\log(2/\eta)$, then
	$\overline{X}$ provides an $\epsilon$-approximation of $\mathbb{E}[X]$ with probability at least $1-\eta$.
```


```{exercise}
Suppose the number of red lights Alex encounters each day to work is on average $4.8$ (according to historical trips to work). Alex really will be late if he encounters $8$ or more red lights. Let $X$ be the number of lights he gets on a given day.

1. Give a bound for $P(X \geq 8)$ using Markov’s inequality.
2. Give a bound for $P(X \geq 8)$ using Chebyshev’s inequality, if we also assume $Var(X) = 2.88$.
3. Give a bound for $P (X \geq 8)$ using the Chernoff bound. Assume that $X∼Bin(12, 0.4)$ - that there are 12 traffic lights, and each is independently red with probability $0.4.$
4. Compute $P(X \geq 8)$ exactly using the assumption from the previous part.
5. Compare the three bounds and their assumptions.
```

```{proof}
We apply all the inequalities learned in this section and discuss them at the end.

1. Since $X$ is nonnegative and we know its expectation, we can apply Markov’s inequality:
  $$P(X \geq 8) \leq \frac{\mathbb{E}[X]}{8} = \frac{4.8}{8} = 0.6$$
  
2. Since we know $X’s$ variance, we can apply Chebyshevs inequality after some manipulation. We have to do this to match the form required:
  $$P(X \geq 8) \leq P(X \geq 8) + P(X \leq 1.6) = P(|X-4.8| \geq 3.2)$$
  The reason we chose $X \leq 1.6$ is so it looks like $P (|X − µ| \geq \alpha)$. Now, applying Chebyshev’s gives:
  $$\leq \frac{Var(X)}{3.2^2} = \frac{2.88}{3.2^2} = 0.28125$$
  
3. Actually, $X \sim Bin(12, 0.4)$ also has $E[X] = np = 4.8$ and $Var(X) = np(1 − p) = 2.88$ (what a coincidence). The Chernoff bound requires something of the form $P(X ≥ (1 + \delta)\mu)$, so we first need to solve for $\delta: (1 + \delta)4.8 = 8$ so that $\delta = 2/3$. Now,
$$P(X \geq 8) = P(X \geq (1+\frac{2}{3}).4.8) \leq exp(\frac{-\frac{2}{3}^2 4.8}{3}) \approx 0.4991$$

4. The exact probabiltity can be found summing the Binomial probability mass function:
  $$P(X \geq 8) = \sum_{k=8}^12 {12\choose k} 0.4^k0.6^{(12-k)} \approx 0.0573$$
  
5. Usually the bounds are tighter as we move down the list Markov, Chebyshev, Chernoff. But in this case Chebyshev’s gave us the tightest bound, even after being weakened by including some additional $P(X \leq 1.6)$. Chernoff bounds will typically be better for farther tails - 8 isn’t considered too far from the mean $4.8$.
It’s also important to note that we found out more information progressively - we can’t blindly apply all these inequalities every time. We need to make sure the conditions are satisfied.

Remarkably, note that even our best bound of $0.28125$ was $5-6$ times larger than the true probability of $0.0573$.

```


# Error propagation and approximation {#error-prop}

This part is based on many different sources, like [@hogan2006combine], [@ku1966notes].
In the following, let $A$ be the quantity that we want to estimate, and $\overline{A}$ our estimate. We have the definition of absolute error and relative error.

:::{.definition #absolute-error name="Absolute error"}
$$|A - \overline{A} | = \epsilon_{Abs}$$
:::


:::{.definition #relative-error name="Relative error"}
$$\frac{| A - \overline{A}| }{A}  = \epsilon_R$$
or equivalently
$$ A(1-\epsilon_{R}) \leq  \overline{A} \leq A(1+\epsilon_{R})$$
:::


Thus observe that:

- If (and only if) $|A| < 1$, then, $\epsilon_{Abs} \leq \epsilon_{R}$
- If (and only if) $|A| > 1$, then, $\epsilon_{Abs} \geq \epsilon_{R}$

We will study the relation between the two errors, often leveraging the trick setting $\epsilon_{Abs} = \epsilon_R A$. Oftentimes, we would like to move from a relative to absolute precision, or vice versa.

##### From absolute to relative precision

Suppose that we have an algorithm that in time $O(f(\frac{1}{\epsilon_{Abs}}))$ gives us $|A-\overline{A}  | \leq \epsilon_{Abs}$ for $\epsilon_{Abs} \in (0, 1]$ and we want a relative error $\epsilon_R > 0$:

- If $|A| < 1$, then we want to obtain an error $\epsilon_{Abs}$ such that $\epsilon_{Abs} = \epsilon_R A$. For this, we need to have a lower bound $\lambda^{-1}$ on $A$. If we have it, we can just set $\epsilon_{Abs} = \epsilon_R \lambda^{-1}$ and run our algorithm in time  $O(f(\frac{\lambda}{\epsilon_{Abs}}))$

- If $|A| > 1$, If we want a relative error $\epsilon_R$, then by running the algorithm with $\epsilon_{Abs}=\epsilon_R$ we have already a relative error bound, as $\frac{|A- \overline{A}|}{A} \leq |A- \overline{A}| \leq \epsilon_{Abs}$Note that we $\epsilon_{Abs}$ is meant to stay $\in (0,1]$ as it wouldn't make sense to have a runtime of $O(\frac{1}{\epsilon_{abs}})$ for $\epsilon_{abs} > 1$.


```{exercise}
Are there cases of algorithms with $\epsilon_{abs} > 1$? Does it make sense? Can you make examples?
```


##### From relative to absolute precision
If we have an algorithm that in time $O(f(\frac{1}{\epsilon_{R}}))$ gives us $|A-\overline{A}  | \leq A\epsilon_{R}$ and we want an absolute $\epsilon_{Abs}$:


- IF $A \leq 1$, we could just call the algorithm with error $\epsilon_R = \epsilon_{Abs}$, and thus obtain $$|A- \overline{A} | \leq \epsilon_R A \Rightarrow |A- \overline{A} | \leq \epsilon_{R}  = \epsilon_{Abs},$$ as the absolute error is an upper bound of the relative error. Note that the runtime of the algorithm might (should!) depend on the quantity $A$ that we want to estimate, so we could improve upon this, by trying to *not* pay a price that depends on $A$ in the runtime. <!-- If we have a lower bound $\lambda$ on $A$, than we can set the error of the algorithm $\epsilon_R = \frac{\epsilon_{Abs}}{\lambda}$, to get the absolute precision we require. -->
-  IF $A > 1$,
        $$|A- \overline{A} | \leq \epsilon_R A $$
        we want
         $$|A- \overline{A} | \leq \epsilon_{Abs} \text{by setting  } \epsilon_R = \frac{\epsilon_{Abs}}{\overline{A}}$$
        By running algorithm $\mathcal{A}$ with error $\epsilon'=\frac{\epsilon}{A}$, i.e. we run it once with $\epsilon_R =1/4$ error, and than.
        We run it again with the improved $\epsilon_R=\frac{1}{\lambda}$, and we have a runtime of $O( \text{f}(\frac{A}{\epsilon^{-1}}))$.


:::{.example}
Amplitude estimation output a scalar $0 \leq \widetilde{p}\leq 1$ which equal some probability $p$, such that $|p-\widetilde{p}| \leq \epsilon p$ in time $O \left(\frac{1}{\epsilon p}\right)$. We have directly an absolute error of $\epsilon$ in this estimate (which we will rarely use, as oftenwe would like to multiply this estimate, so that the error scales proportionately).
:::


<!-- <!--         \emph{Example:} We use an algorithm that gives the log-determinant (of a  matrix with spectral norm smaller than 1) with relative error $\frac{1}{4}$, so $\log\det(1-\frac{1}{4}) \leq \overline{\log\det} \leq \log\det(1+\frac{1}{4}) \Rightarrow \frac{3}{4}\leq \frac{\overline{\log\det}}{\log\det} \leq \frac{5}{4}$. Now we run it with error $\epsilon'_R =  \frac{\epsilon_{Abs}}{4\overline{\log\det}}$.  --> -->
<!-- <!--         \textcolor{red}{maybe lo devo far correre anche all'inizio con $\epsilon=\epsilon_{abs}$ invece che 1/4?} --> -->
<!-- <!--     \end{itemize} --> -->

<!-- <!-- \textbf{Exercies} --> -->
<!-- <!-- This comes from \cite[trace estimation]{sanderthesis}\cite{van2020quantum} --> -->
<!-- <!-- $$|A-B| \leq \epsilon_1 $$ --> -->
<!-- <!-- $$|C-B| \leq \epsilon_2C $$ --> -->
<!-- <!-- Can you prove that:  --> -->
<!-- <!-- $$|C-A| \leq \epsilon_2(C+\epsilon_1) + \epsilon_1 $$ --> -->


### Propagation of error in functions of one variable

Check [this](https://chem.libretexts.org/Bookshelves/Analytical_Chemistry/Supplemental_Modules_(Analytical_Chemistry)/Quantifying_Nature/Significant_Digits/Propagation_of_Error), [this](https://foothill.edu/psme/daley/tutorials_files/10.%20Error%20Propagation.pdf), [this](https://math.jacobs-university.de/oliver/teaching/jacobs/fall2015/esm106/handouts/error-propagation.pdf), and [@hogan2006combine].


<!-- <!-- In the following, $\Delta A$ is an absolute error.  --> -->

<!-- <!-- -  $A=\lambda B \Rightarrow \Delta A = |\lambda | \Delta B$ --> -->
<!-- <!-- -  $A = \lambda/B \Rightarrow  \Delta A =  |\lambda /B^2| \Delta B = (substitute) = |A/B|\Delta B $  --> -->
<!-- <!-- -  $A = \lambda B^\mu \Rightarrow \Delta A = |\mu \lambda B^{\mu^{-1}}|\Delta B = (substitute)= |\mu A /B | \Delta B$  --> -->
<!-- <!-- -  $A = \lambda e^{\mu B} \Rightarrow \Delta A = |\mu A| \Delta B$ (where is $\lambda$?)  --> -->
<!-- <!--     This because if  --> -->
<!-- <!--     $$A = \lambda e^{\mu B}$$ --> -->
<!-- <!--     The relative error is  --> -->
<!-- <!--     $$\frac{\Delta A }{A} = \frac{\lambda \mu e^{ \mu B} \Delta B }{\lambda e^{\mu B}} = |\mu|\Delta B$$ --> -->
<!-- <!--     so the absolute error follows from multiplying this quantity by $A$.  --> -->

<!-- <!-- -  $A = \kambda \ln (\mu B) \Rightarrow \Delta A = |\lambda/B|\Delta B$  --> -->


<!-- <!-- #### Propagation of error in function of more variables --> -->

<!-- <!-- ##### Linear combination of absolute error --> -->
<!-- <!-- Imagine we have a derived quantity based on some measures: --> -->
<!-- <!-- $$y = a+b+c $$ --> -->

<!-- <!-- We split this analysis in two cases: --> -->

<!-- <!-- If we don't have information on the nature of the error (i.e. we don't know the sign but only the magnitude. If $y = \sum_i (x_i + \epsilon_i)$, then  --> -->
<!-- <!-- $$ \delta y = \delta a+\delta b+\delta c $$ --> -->
<!-- <!-- $$ \delta y = \sum_i \left| \epsilon_i \right| = n\epsilon_{max} $$ --> -->

<!-- <!-- If we know that the error is Gaussian (i.e. maybe we have to prove something about mean and variance), and we can show than somehow they cancel each other, then: --> -->

<!-- <!-- $$\delta y=  \sqrt{\sum_i^n \epsilon_i^2} = \sqrt{n}\epsilon_{max}$$ --> -->

<!-- <!-- This often called the \emph{root mean squared error}. If these errors have a Gaussian distribution, than approx $68\%$ of the individual will lie between $y-\delta y$ and  $y+\delta y$ and $95\%$ of the individual will lie between $y-2\delta y$ and  $y+3\delta y$ --> -->


<!-- <!-- \subsection{Inverse of relative error \textcolor{blue}{[ok: - reread]} } --> -->
<!-- <!-- \begin{itemize} --> -->
<!-- <!--     \item $|A-\overline A | \leq \epsilon A$ --> -->
<!-- <!--     \end{itemize} --> -->
<!-- <!--      $$|\frac{1}{A} - \frac{1}{\overline{A}}| =| \frac{A(1+\epsilon) - A}{A^2(1+\epsilon)} | = |\frac{A\epsilon}{A^2(1+\epsilon)}| = \frac{\epsilon}{A}\frac{1}{(1+\epsilon)} \leq  \frac{\epsilon}{A}   $$ --> -->

<!-- <!-- \subsection{Inverse of absolute error} --> -->
<!-- <!-- \begin{itemize} --> -->
<!-- <!--     \item $|A-\overline A | \leq \epsilon $ --> -->
<!-- <!--     \end{itemize} --> -->

<!-- <!--     $$| \frac{1}{A} - \frac{1}{\overline{A}}| = | \frac{\overline{A} - A}{A\overline{A}} |  = |\frac{A + \epsilon - A}{A\overline{A}}|  = |\frac{\epsilon}{A(A+\epsilon)}| = |\frac{\epsilon}{A}  \frac{1}{\overline{A}} | $$ --> -->
<!-- <!--     \begin{itemize} --> -->
<!-- <!--         \item     If $A > 1$. Then:  --> -->
<!-- <!--         $$\leq | {\epsilon \over A\overline{A}}|  \leq |\frac{\epsilon}{A}| $$ --> -->
<!-- <!--         \item If $A \leq 1$. Then: --> -->
<!-- <!-- $$ \textcolor{red}{?} $$ --> -->
<!-- <!--     \end{itemize} --> -->

<!-- <!-- \textcolor{red}{check passages because they give the same result of relative error} --> -->


<!-- <!-- \subsection{Relative error of product of relative errors \textcolor{blue}{[ok: - reread]}} --> -->
<!-- <!-- \begin{itemize} --> -->
<!-- <!--     \item $|A-\overline A | \leq \epsilon_1 A$ --> -->
<!-- <!-- \item $|B-\overline B | \leq \epsilon_2 B$ --> -->
<!-- <!-- \end{itemize} --> -->

<!-- <!-- $$|AB - \overline{AB}| = |AB - AB(1+\epsilon_1)(1+\epsilon_2) | = $$ --> -->
<!-- <!-- $$|AB - AB + AB\epsilon_1 + AB\epsilon_2 +AB\epsilon_1\epsilon_2 | = AB(\epsilon_1+\epsilon_2)  $$ --> -->


<!-- <!-- \subsection{Relative error of product of absolute error \textcolor{blue}{[ok: - reread]}} --> -->
<!-- <!-- \begin{itemize} --> -->
<!-- <!--     \item $|A-\overline A | \leq \epsilon_1$ --> -->
<!-- <!-- \item $|B-\overline B | \leq \epsilon_2$ --> -->
<!-- <!-- \end{itemize} --> -->

<!-- <!-- $$\frac{|AB - \overline{AB}|}{AB} \leq ? $$  --> -->
<!-- <!-- $$|AB- (A+\epsilon_1)(B+\epsilon_2)| = | AB - AB-A\epsilon_2 -B\epsilon_1 - \epsilon_1\epsilon_2| = |-A\epsilon_2 - B\epsilon_1 + O(\epsilon_1 \epsilon_2) | \Rightarrow$$ --> -->
<!-- <!-- $$\frac{A\epsilon_2 + B\epsilon_1}{AB} = \frac{\epsilon_2}{B} + \frac{\epsilon_1}{A} $$  --> -->

<!-- <!-- \subsection{Ratio of absolute errors} --> -->
<!-- <!-- \begin{itemize} --> -->
<!-- <!--     \item $|A-\overline A | \leq \epsilon_1$ --> -->
<!-- <!-- \item $|B-\overline B | \leq \epsilon_2$ --> -->
<!-- <!-- \end{itemize} --> -->

<!-- <!-- $$ \frac{A/B - \overline{A}/\overline{B} }{A/B} $$ --> -->

<!-- <!-- \subsection{Ratio of relative errors \textcolor{blue}{[ok: - reread]}} --> -->
<!-- <!-- \begin{itemize} --> -->
<!-- <!--     \item $|A-\overline A | \leq A\epsilon_1$ --> -->
<!-- <!-- \item $|B-\overline B | \leq B\epsilon_2$ --> -->
<!-- <!-- \end{itemize} --> -->

<!-- <!-- $$\frac{\overline{A}}{\overline{B}} = \frac{A(1+\epsilon_1)}{B(1+\epsilon_2)} = \frac{A}{B} (1+\epsilon_1) \frac{1}{(1+\epsilon_2)} $$ --> -->

<!-- <!-- We use Taylor approx of $\frac{1}{(1+\epsilon_2)} = 1-\epsilon_2 + \epsilon_2^2 $. Therefore,  --> -->


<!-- <!-- $$ \frac{\overline{A}}{\overline{B}} = \frac{A}{B}(1+\epsilon_1)(1-\epsilon_2)  $$  --> -->
<!-- <!-- so --> -->
<!-- <!-- $$|\frac{A}{B} - \frac{\overline{A}}{\overline{B}}|  = \frac{A}{B}|(\epsilon_1+\epsilon_2)|  \leq \frac{A}{B}(\epsilon_1+\epsilon_2) $$ --> -->


<!-- This is a better formalization and proof. -->

<!-- ```{theorem, ratio-relative-simple, name="[@sanderthesis] [@van2020quantum]"} -->
<!-- Let $0 \leq \theta \leq 1$ and let $\alpha, \beta, \tilde{\alpha}, \tilde{\beta}$ be positive real numbers, such that $|\alpha - \tilde{\alpha}| \leq \alpha\theta/3$, and  $|\beta - \tilde{\beta}| \leq \beta\theta/3$. Then: -->
<!-- $$\left|\frac{\alpha}{\beta} - \frac{\tilde{\alpha}}{\tilde{\beta}} \right| \leq \theta \frac{\alpha}{\beta}$$ -->
<!-- ``` -->

<!-- ```{proof} -->
<!-- $$\left|\frac{\alpha}{\beta} - \frac{\tilde{\alpha}}{\tilde{\beta}} \right|  = \left| \frac{\alpha\tilde{\beta}- \tilde{\alpha}\beta}{\beta\tilde{\beta}}  \right| = \left| \frac{\alpha\tilde{\beta}- \tilde{\alpha}\beta+\alpha\beta - \alpha\beta}{\beta\tilde{\beta}}  \right| =$$ -->
<!-- $$\left|\frac{\alpha\beta-\tilde{\alpha}\beta}{\beta\tilde{\beta}} \right| + \alpha\left| \frac{\beta -\tilde{\beta}}{\beta\tilde{\beta}} \right| \leq$$ -->

<!-- $$\left|\frac{\alpha -\tilde \alpha}{\tilde \beta} \right| + \frac{\alpha}{\tilde\beta}\theta/3$$ -->
<!-- Using the hypotesis on $|\alpha-\widetilde{\alpha}|\leq \alpha \theta/3$, -->
<!-- $$\leq \left|\frac{1}{\tilde{\beta}} \right|\frac{\alpha\theta}{3} + \frac{\alpha}{\tilde\beta}\theta/3$$ -->
<!-- Now we can use the fact that $\tilde \beta \geq \frac{2}{3}\beta$. This comes from the fact that if we have a relative error on $\beta$, we know that $|\tilde{\beta}| \leq \beta (1+\theta)$. In our case, $\beta(1-\frac{2}{3}\theta) \leq \tilde{\beta} \leq \theta (1+\frac{1}{3}\theta)$, so in the worst case, for $\theta=1$, we can simply see that $\tilde \beta \geq \frac{2}{3}\beta$. Now can put the lower bound on the denominator, and get: -->
<!-- $$\leq \left|\frac{\alpha\theta}{3\tilde{\beta}} \right| + \frac{\alpha}{\tilde\beta}\theta/3$$ -->

<!-- ``` -->

(ref:hamoudi2020quantum) [@hamoudi2020quantum]


<!-- We can have a more general version of the bound. -->
::: {.lemma #ratio-relative-general name="(ref:hamoudi2020quantum)"}
Let $\tilde a$ be an estimate of $a>0$ such that $\vert \tilde a- a \vert \leq \epsilon_a a$.
with $\epsilon_a \in (0,1)$.
Similarly, let $\tilde b$ be an estimate of $b>0$ and $\epsilon_b \in (0,1)$ such that
$\vert \tilde b - b \vert \leq \epsilon_b b$.
Then the ratio $a/b$ is estimated to relative error
$\left \vert \frac{\tilde a}{\tilde b} -  \frac{a}{b}  \right\vert \leq \left ( \frac{\epsilon_a + \epsilon_b}{1-\epsilon_b} \right) \frac{a}{b}$.
:::

The proof comes directly from their work
```{proof}
Note that
$b -  \tilde{b} \leq \vert \tilde{b} - b\vert \leq \epsilon_b b$, so as we said before,
deduce $\frac{1}{ \tilde b} \leq \frac{1}{ b (1-\epsilon_b)}$.

Now we can combine the previous observation:
\begin{align}
\left| \frac{\tilde a}{\tilde b} -  \frac{a}{b}  \right| = &
  \left \vert \frac{\tilde a b - a \tilde b}{\tilde b b}  \right\vert = \left \vert \frac{\tilde a b - ab + ab - a \tilde b}{\tilde b b}  \right\vert = \left \vert \frac{\tilde a  - a}{\tilde b} + \frac{a}{\tilde b} \frac{b - \tilde b}{ b}  \right\vert   \\
\leq & \left \vert \frac{\tilde a  - a}{\tilde b} \right\vert+  \frac{a}{\tilde b}  \left \vert \frac{b - \tilde b}{ b}  \right\vert \leq \frac{\epsilon_a a + \epsilon_b a }{\tilde b}   \leq  \frac{a}{b}\frac{\epsilon_a +\epsilon_b}{ (1-\epsilon_b)}.
\end{align}


```


<!-- <!-- #### General case --> -->

<!-- <!-- ```{theorem, name="mean value theorem"} --> -->
<!-- <!-- $$|f(x)- f(y)| \leq c |x-y|$$ --> -->
<!-- <!-- ``` --> -->


<!-- <!-- - \url{https://math.stackexchange.com/questions/302177/proving-the-relative-error-of-division} --> -->
<!-- <!-- - \url{https://math.stackexchange.com/questions/1153476/how-to-show-the-relative-error-of-fracx-ay-a?rq=1} --> -->
<!-- <!-- - \url{https://math.stackexchange.com/questions/3000501/relative-error-of-division/3001050#3001050} --> -->


## Useful quantum subroutines and folklore results

We will often make use of a tool developed in [@wiebe2018quantum]. It is standard technique in classical computer science to boost the success probability of a randomized algorithm by repeating it and computing some statistics in the results. For the case of quantum algorithms, in high level, we take multiple copies of the output of the amplitude estimation procedure, compute the median, and reverse the circuit in order to get rid of the garbage.


```{lemma, median, name="Median evaluation [@wiebe2018quantum]"}
Let $\mathcal{U}$ be a unitary operation that maps
$$\mathcal{U}:\ket{0^{\otimes n}}\mapsto \sqrt{a}\ket{x,1}+\sqrt{1-a} \ket{G,0}$$
for some $1/2 < a \le 1$ in time $T$. Then there exists a quantum algorithm that, for any $\Delta>0$ and for any $1/2<a_0 \le a$, produces a state $\ket{\Psi}$ such that $\|\ket{\Psi}-\ket{0^{\otimes nL}}\ket{x}\|\le \sqrt{2\Delta}$ for some integer $L$, in time
$$
2T\left\lceil\frac{\ln(1/\Delta)}{2\left(|a_0|-\frac{1}{2} \right)^2}\right\rceil.
$$
```


We will report here some simple statements from literature which now are folklore.


```{lemma, quattrocinque, name="[@kerenidis2019qmeans]"}
Let $\epsilon_b$ be the error we commit in estimating $\ket{c}$ such that $\norm{ \ket{c} - \ket{\overline{c}}} < \epsilon_b$, and $\epsilon_a$ the error we commit in the estimating the norms,  $|\norm{c} - \overline{\norm{c}}| \leq \epsilon_a \norm{c}$. Then $\norm{\overline{c} - c} \leq \sqrt{\eta} (\epsilon_a +  \epsilon_b)$.
```

```{lemma, kereclaim, name="[@kerenidis2017quantumsquares]"}
Let $\theta$ be the angle between vectors $x,y$, and assume that $\theta < \pi/2$.
Then, $\norm{x-y} \leq \epsilon$ implies $\norm{\ket{x} - \ket{y}} \leq \frac{\sqrt{2}\epsilon}{\norm{x}}$. Where $\ket{x}$ and $\ket{y}$ are two unit vectors in $\ell_2$ norm.
```

# Approximation theory
In this section we collect some polynomial approximation of useful functions that can be used in your quantum algorithms. As we learned in Section \@ref(subsec:svt), in order to obtain efficient quantum algorithms we often need a low-degree polynomial approximation of certain functions. This is an interesting link between approximation theory (where it's relatively easy to obtain some lower bounds) and quantum algorithms. We will start by recalling an important tool in approximation theory.

Chebyshev polynomials play an important role in approximation theory (see, for example, [@Iske2018]).

```{definition, name="Chebyshev polynomial"}
For $n \in \mathbb{N}_{0}$, the Chebyshev polynomial $T_{n}$ of degree $n$ is the function defined on the interval $[-1, \, 1]$ by
$$T_{n}(x) := \cos (n \arccos(x)).$$
```


## Polynomial approximation of $\log(x)$ {#polyapprox-log}

(ref:distributional) [@distributional]

```{lemma, poly-approx-ln, name="Polynomial approximations of logarithm (ref:distributional)"}
Let $\beta\in(0,1]$, $\epsilon\in(0,{1}/{6}]$. Then there exists a polynomial $\tilde{S}$
of degree $O({\frac{1}{\beta}\log (\frac{1}{\epsilon} )})$ such that $|\tilde{S}(x)-\frac{\log_b(x)}{3\log_b(2/\beta)}|\leq\epsilon$ for all $x \in [\beta,1]$ and base $b \in \mathbb{N}$, and for all $x\in [-1,1]$ $1/2\leq \tilde{S}(x) = \tilde{S}(-x) \leq 1/2$.
```


<!-- ```{proof} -->
<!-- Recall Lemma \@ref(lemma:serieslog).  -->
<!-- We follow the same steps of the proof of \cite[Lemma 11]{distributional}.  -->
<!-- We consider the standard Taylor expansion of $\frac{\log(x)}{3\log(2/\beta)}$. centered in $x_0=1$, which is $f(x)=\frac{1}{3 \log(2/\beta)}\sum_{l \geq 1} \frac{(-1)^{1+n}(-1+x)^n}{n}$. We use this polynomial in \cite[Corollary 16]{distributional} with the choice of $\epsilon=\eta/2$, $x_0=1$, $r=1-\beta$, $\nu = \frac{\beta}{2}$. This Corollary gives us another polynomial $S \in \mathbb{C}[x]$ of degree $O(\frac{1}{\beta}\log(\frac{1}{\eta}))$ with the following properties:  -->
<!-- \begin{align} -->
<!--     \|f(x) - S(s)\|_{[\beta, 2-\beta]} \leq \eta/2 \\ -->
<!--     \|S(x)\|_{[-1, 1]} \leq B+\eta/2 \leq \frac{1}{3}+\frac{\eta}{2} \\ -->
<!--     \|S(x)\|_{[-1, \beta/2]} \leq \eta/2 -->
<!-- \end{align} -->
<!-- Now we show that indeed $B=1/3$ in Lemma \@ref(lemma:cor66) (following the steps of the original proof). We have to consider $f(x_0 +x) = \frac{1}{3 \log(2/\beta)}\sum_{l=0}^{\infty} = \frac{a_l (-1+x+x_0)^l}{l}$, and find a bound for $\sum_{l=0}^\infty (r+\nu)^l|a_l|$, which in our case (as $a_l = (-1)^{1+l}$) is $\frac{1}{3\log(2/\beta)}\sum_{l=0}^\infty (1-\beta/2)^l$.  -->

<!-- To make the polynomial even we define $\widetilde{S} = S(x) + S(-x)$. -->
<!-- Now we show that the error is bounded in the interval $[\beta, 1]$, and the value of the polynomial is bounded in the interval $[-1,1]$. -->

<!-- Using the properties of Corollary 16, along with the triangle inequality, we can see that  -->
<!-- \begin{align} -->
<!-- \| f(x) - \widetilde{S}(x)\|_{[\beta, 1]} & = \|f(x) - S(x)\|_{[\beta, 1]} + \norm{S(-x)}_{[\beta, 1]} \\ -->
<!-- & = \|f(x) - S(x)\|_{[\beta, 1]} + \norm{S(x)}_{[-1, -\beta]} -->
<!-- \leq \frac{\eta}{2} + \frac{\eta}{2} = \eta. -->
<!-- \end{align} -->


<!-- Now we show that $\widetilde{S}(x)$ is bounded by $1/2$ in the interval $[-1, 1]$. \begin{align} -->
<!-- \norm{\widetilde{S}(x)}_{[-1, 1]} & = \norm{\widetilde{S}(x)}_{[0,1]} \leq \norm{S(x)}_{[0,1]} + \norm{S(x)}_{[-1,0]} \nonumber \\ -->
<!-- & \leq  \frac{1}{3}+\frac{\eta}{2} + \frac{\eta}{2} = \frac{1}{3} + \eta \leq  \frac{1}{2}. -->
<!-- \end{align} -->
<!-- Where we used respectively the fact that the function is even (first equality), and that $\widetilde{S}$ can be bounded in function of $S(x)$ using triangle inequality (second inequality) and the properties of the polynomial expansion derived before. -->
<!-- If the polynomial expansion has complex parts, we can discard them.  -->
<!-- ``` -->


## Polynomial approximation of $1/x$ {#polyapprox-1overx}
We are going to approximate $1/x$ using Chebychev polynomial, with some additional tricks, which will be used to decrease the degree of the polynomial approximation. As 

The function $1/x$ has an essential discontinuity in $x=0$ since

$$\lim_{x\to 0-} \frac{1}{x} = -\infty \text{ and } \lim_{x\to 0+} \frac{1}{x} = +\infty.$$

In this section we will follow [@Childs2017] and show that $1/x$ can be approximated arbitrarily closely on the set $[-1,\, -\delta ] \cup [\delta,\, 1 ]$, where $0<\delta<1$, by a linear combination of Chebyshev polynomials. We start by approximating $1/x$ with the following function: 

\begin{equation}
g_{b}(x) := \begin{cases}
        (1-(1-x^{2})^{b})/x, \text{ if } x \in \mathbb{R}\setminus\{0\}\\
        0, \text{ if } x=0
        \end{cases}
(\#eq:continuous-approximation-of-1-on-x)
\end{equation}
where $b>0$ is a positive constant.

```{lemma}
The function \@ref(eq:continuous-approximation-of-1-on-x) is continuous in $\mathbb{R}$.
```

```{proof}
We need to show the continuity of $g_{b}$ in $x=0$. This follows by an application of L'Hôpital's rule:
$$ \lim_{x\to 0} \frac{1-(1-x^{2})^{b}}{x} = \lim_{x\to 0} \frac{2bx(1-x^{2})^{b-1}}{1} = 0 = g_{b}(0).$$ 
```

The following lemma shows that $g_{b}$ approximates $1/x$ arbitrarily closely on the set $[-1,\, -\delta ] \cup [\delta,\, 1 ]$, $0<\delta<1$, if $b$ is large enough. 

```{lemma, inverse-x-approximation}
Let $\epsilon>0$ and $0<\delta<1$. If $b \ge \max \{1, \delta^{-2}\log(1/(\epsilon \delta))\}$, then $|g_{b}(x)-1/x|<\epsilon$ for all $x \in [-1,\, -\delta ] \cup [\delta,\, 1 ]$.
```

```{proof}
From the inequality $e^{a}\ge 1+a$, $a \in \mathbb{R}$, and $b>1$ it follows with $a:=-\delta^{2}$
$$ (1-\delta^{2})^{b} \le e^{-\delta^{2} b}.$$
Therefore, for $x \in [-1,\, -\delta ] \cup [\delta,\, 1 ]$ we have 
$$ \left|  g_{b}(x)-\frac{1}{x} \right|  = \frac{(1-x^{2})^{b}}{|x|} \le \frac{(1-\delta^{2})^{b}}{\delta} \le \frac{e^{-\delta^{2} b}}{\delta} \le \epsilon.$$ 
```

The following lemma shows that $g_{m}$, $m \in \mathbb{N}$, is on the interval $[-1, \, 1]$ equal to a linear combination of Chebyshev polynomials of degree at most $2m-1$.

```{lemma, polynomial-expansion-of-gm}
Let $m \in \mathbb{N}$ be a positive integer. Then,

\begin{equation}
g_{m}(x) = 4 \sum_{n=0}^{m-1} (-1)^{n} \left(\frac{\sum_{k=n+1}^{m}\binom{2m}{m+k}}{2^{2m}} \right) T_{2n+1}(x)
(\#eq:polynomial-expansion-of-gm)
\end{equation}

for all $x \in [-1, \, 1]$.
```

```{proof}
For $x=0$ the equality \@ref(eq:polynomial-expansion-of-gm) follows from the definitions of Chebyshev polynomials and $g_{m}$. We are left with the task to prove

$$\frac{1-(1-x^{2})^{m}}{x} = 4 \sum_{n=0}^{m-1} (-1)^{n} \left(\frac{\sum_{k=n+1}^{m}\binom{2m}{m+k}}{2^{2m}} \right) T_{2n+1}(x) $$
for all $x \in [-1, \, 1]\setminus\{0\}$. Choose $\theta \in \mathbb{R}$ such that $x=\cos (\theta)$. Because $\sin^{2}(\theta) + \cos^{2}(\theta) = 1$ and $T_{n}(\cos (\theta))= \cos (n \theta)$, we need to prove that

\begin{equation}
1 - \sin^{2m}(\theta) = 4 \sum_{n=0}^{m-1} (-1)^{n} \left(\frac{\sum_{k=n+1}^{m}\binom{2m}{m+k}}{2^{2m}} \right) \cos ((2n+1)\theta) \cos (\theta),
(\#eq:equivalent-polynomial-expansion-of-gm)
\end{equation}
where in the previous equation we moved $\cos(\theta)$ from the denominator of the l.h.s. to the r.h.s.. 

In order to complete the proof we can proceed in two different ways. Either we will write the left and right side of Equation \@ref(eq:equivalent-polynomial-expansion-of-gm) respectively in the form

\begin{equation}
\sum_{j=0}^{m} a_{j} \cos(2j\theta) \text{ and } \sum_{j=0}^{m} b_{j} \cos(2j\theta),
(\#eq:coefficients-polynomial-expansion-of-gm)
\end{equation}

and, finally, verify that $a_{j}=b_{j}$ for all  $j \in \{0,\,\dots,\,m\}$. This approach follows the original proof of [@Childs2017]. Another way is to just convert the l.h.s. of Equation \@ref(eq:equivalent-polynomial-expansion-of-gm) so it matches the coefficients of the r.h.s. We will start with the second approach, as it is the canonical one, but we discuss the second one later, as it is the one presented in the original proof. 


Using the binomial formula for complex numbers $a,\,b \in \mathbb{C}$ and $p \in \mathbb{N}_{0}$, which is stated in Theorem \@ref(thm:bernoulli)

$$ (a+b)^{p}= \sum_{j=0}^{p}\binom{p}{j} a^{p-j}b^{j},$$
we obtain a nice expansion of the $\sin(\theta)^{2m}$ as:

\begin{equation*}
\begin{split}
\sin^{2m}(\theta) & = \left(\frac{e^{i\theta}-e^{-i\theta}}{2i}\right)^{2m}\\
                  & = \frac{1}{2^{2m}i^{2m}} \sum_{j=0}^{2m}\binom{2m}{j} (-e^{-i\theta})^{2m-j}(e^{i\theta})^{j}\\
                 & = \frac{(-1)^{m}}{2^{2m}} \sum_{j=0}^{2m}\binom{2m}{j} (-1)^{j} e^{i(2j-2m)\theta} \\
                 & = \frac{1}{2^{2m}}\binom{2m}{m} + \frac{(-1)^{m}}{2^{2m}} \sum_{j=0}^{m-1} \left( \binom{2m}{j} (-1)^{j} e^{i(2j-2m)\theta} + \binom{2m}{2m-j} (-1)^{2m-j} e^{-i(2j-2m)\theta}\right)  \\
                  & = \frac{1}{2^{2m}}\binom{2m}{m} + \frac{2}{2^{2m}} \sum_{j=0}^{m-1} \binom{2m}{j} (-1)^{m-j} \frac{e^{i(2j-2m)\theta}+e^{-i(2j-2m)\theta}}{2}\\
                  & = \frac{1}{2^{2m}}\binom{2m}{m} + \frac{2}{2^{2m}} \sum_{j=0}^{m-1} \binom{2m}{j} (-1)^{m-j} \cos(2(j-m)\theta).
\end{split}
\end{equation*}

To give more context on some of the steps in the previous series of equations, we have done the following: 
  
- we wrote the $\sin$ using the euler formula. 
- we used the binomial theorem
- we used that $(-1)^{j}=(-1)^{2m-j}$ to factor out the minus sign
- we exploited the [symmetry of the binomial](https://proofwiki.org/wiki/Symmetry_Rule_for_Binomial_Coefficients), and removed the middle term ($\frac{1}{2^{2m}}\binom{2m}{m}$) from the summation. 
- we collect back the $(-1)^m$ in the summation
- we used again Euler formula (multiplying and dividing everything by $2$) to obtain a $\cos$ in the summation.

Thus, we have
\begin{equation}
1-\sin^{2m}(\theta) = \left( 1-\frac{1}{2^{2m}}\binom{2m}{m}\right)+\frac{2}{2^{2m}} \sum_{j=0}^{m-1} (-1)^{m-j-1} \binom{2m}{j}  \cos(2(j-m)\theta).
(\#eq:left-side-polynomial-expansion-of-gm)
\end{equation}

Note that we factored the $-$ sign in the third term in the exponent of the $(-1)$ factor. Now we perform a substitution, and set $k=m-j$. Recall that cosine is an even function, and thus $\cos(2(j-m)\theta) = \cos(2(m-j)\theta) =\cos(2k\theta)$. Also recalling again the symmetry of the binomial coefficients, i.e. $\binom{2m}{j} = \binom{2m}{2m-j} = \binom{2m}{m+k}$, we can rewrite our function as: 

\begin{equation}
1 - \sin^{2m}(\theta) = 1 - \frac{1}{2^{2m}} \binom{2m}{m} + \frac{2}{2^{2m}} \sum_{k=1}^{m} \binom{2m}{m+k} (-1)^{k-1} \cos(2k\theta)\,.
\end{equation}

Note also that because of the substition step, now we have the index of $k$ that goes from $1$ to $m$. Now we have a first trick: $2^{2m}=(1+1)^{2m} = \sum_{i=0}^{2m} \binom{2m}{i}$.This allows to rewrite the first two terms of the summation as:

\begin{equation}
1 - \frac{1}{2^{2m}} \binom{2m}{m} =  \frac{2^{2m}-\binom{2m}{m}}{2^{2m}} = \frac{1}{2^{2m}}\sum_{k=1}^{m} \left( \binom{2m}{m+k} + \binom{2m}{m-k}  \right) = \frac{2}{2^{2m}}\sum_{k=1}^{m} \binom{2m}{m+k} \,.
\end{equation}

Where we used the fact that $\binom{2m}{m}$ is the central term in $\sum_{i=0}^{2m} \binom{2m}{i}$, and its symmetry again. Thus, the whole equation can be rewritten by factoring $\binom{2m}{m+k}$ as: 

\begin{align}
1 - \sin^{2m}(\theta) = &\frac{2}{2^{2m}}\sum_{k=1}^{m} \binom{2m}{m+k}+ \frac{2}{2^{2m}} \sum_{k=1}^{m} \binom{2m}{m+k} (-1)^{k-1} \cos(2k\theta)\\ 
 = & 1 - \sin^{2m}(\theta) =\frac{2}{2^{2m}} \sum_{k=1}^{m} \binom{2m}{m+k} \times \Big(1+(-1)^{k-1} \cos(2k\theta)\Big)\,,
\end{align}

Now we rewrite the last factor as a telescoping sum as:
\begin{equation}
1+(-1)^{k-1} \cos(2k\theta) = \sum_{n=0}^{k-1} (-1)^n \left(  \cos(2\theta (n+1)) + \cos(2\theta n) \right)\,.
\end{equation}

Now, we want to obtain a product of cosines, as Equation \@ref(eq:equivalent-polynomial-expansion-of-gm), we use the standard formula (Section \@ref(subsec:trigonometry)) of $\cos(x+y)+\cos(x-y)=2\cos(x)\cos(y)$. Specifically, we set $x+y=2\theta (n+1)$ and $x-y = 2\theta n$, so we obtain the value of $x=(2n+1)\theta$ and $y = \theta$. Thanks to these two passages we can rewrite 

\begin{equation}
1-\sin^{2m}(\theta) =\frac{2}{2^{2m}} \sum_{k=1}^{m} \binom{2m}{m+k} \times \Big(2\sum_{n=0}^{k-1} (-1)^n \cos((2n+1)\theta) \cos \theta\Big)\,.
\end{equation}  

We need one more step to obtain the coefficients of the Chebychev polynomial from the previous equation, which is a simple change in the order of the summations. To be very explicit, we are just using the following observation:
$$\sum_{k=1}^m f_k (\sum_{n=0}^{k-1} c_n) = \sum_{k=1}^m \sum_{n=0}^{k-1}f_kc_n = \sum_{n=1}^{m-1}c_n(\sum_{k=n+1}^m f_k).$$

Thus, we obtain

\begin{equation}
1-\sin^{2m}(\theta) = \frac{4}{2^{2m}}  \sum_{n=0}^{m-1} ((-1)^n  \left( \sum_{k=n+1}^m \binom{2m}{m+k}\right) \cos((2n+1)\theta)\cos(\theta)).
\end{equation}


The proof ends here, but for completeness, we report the other approach here. We can start by comparing comparing the first equation of \@ref(eq:coefficients-polynomial-expansion-of-gm) with \@ref(eq:left-side-polynomial-expansion-of-gm), we obtain the following formulas for $a_{j}$:
\begin{equation*}
a_{j} = \begin{cases}
1-\frac{1}{2^{2m}}\binom{2m}{m}, \text{ if } j=0\\
(-1)^{j-1} \frac{2}{2^{2m}} \binom{2m}{m-j}, \text{ if } j \in \{1,\,\dots,m\}.
\end{cases}
\end{equation*}

Using the trigonometric identity $2\cos(\theta_{1})\cos(\theta_{2})=\cos(\theta_{1}-\theta_{2}) +\cos(\theta_{1}+\theta_{2})$, the right side of \@ref(eq:equivalent-polynomial-expansion-of-gm) simplifies to
\begin{equation}
2 \sum_{n=0}^{m-1} (-1)^{n} \left(\frac{\sum_{k=n+1}^{m}\binom{2m}{m+k}}{2^{2m}} \right) (\cos ((2n+2)\theta) +\cos ((2n)\theta)).
(\#eq:right-side-polynomial-expansion-of-gm)
\end{equation}


Comparing the second equation of \@ref(eq:coefficients-polynomial-expansion-of-gm) with \@ref(eq:right-side-polynomial-expansion-of-gm), we obtain the following formulas for $b_{k}$:
\begin{equation*}
\begin{split}
b_{0} & = \frac{2}{2^{2m}}\sum_{k=1}^{m}\binom{2m}{m+k}=\frac{1}{2^{2m}}\left(\sum_{k=1}^{m} \binom{2m}{m+k}+\sum_{k=1}^{m} \binom{2m}{m+k} \right) \\
& = \frac{1}{2^{2m}} \left(\sum_{k=1}^{m} \binom{2m}{m-k}+\sum_{k=m+1}^{2m} \binom{2m}{k} \right) \\
& = \frac{1}{2^{2m}} \left(\sum_{k=0}^{m-1} \binom{2m}{k}+\sum_{k=m+1}^{2m} \binom{2m}{k} \right) \\
& = \frac{1}{2^{2m}} \left(\sum_{k=0}^{2m} \binom{2m}{k}- \binom{2m}{m} \right) \\
& = 1 - \frac{1}{2^{2m}}  \binom{2m}{m},
\end{split}
\end{equation*}

and, for $j \in \{1,\, \dots, \, m-1 \}$,
\begin{equation*}
\begin{split}
b_{j} & = \frac{2}{2^{2m}} (-1)^{j-1} \sum_{l=j}^{m}\binom{2m}{m+l} + \frac{2}{2^{2m}} (-1)^{j} \sum_{l=j+1}^{m}\binom{2m}{m+l} \\
& = \frac{2}{2^{2m}} (-1)^{j-1} \binom{2m}{m+j} \\
& = \frac{2}{2^{2m}} (-1)^{j-1} \binom{2m}{m-j},
\end{split}
\end{equation*}


and, for $j = m$,
\begin{equation*}
b_{m} =  \frac{1}{2^{2m}} (-1)^{m-1} \binom{2m}{2m} = \frac{(-1)^{m-1}}{2^{2m}}.
\end{equation*}
Thus, $a_{j}=b_{j}$ for all  $j \in \{0,\,\dots,\,m\}$.
```


Although $g_{m}$ is the sum of $m$ Chebyshev polynomials, the following lemma shows that for $m$ large enough the sum can be truncated while still remaining close to $g_{m}$. 

```{lemma, polynomial-approximation-of-gm}
Let $0<\epsilon<1$ and $m \in \mathbb{N} \cap [2, \, \infty [$. Set $k_{0}:=\min \{m-1, \, \lfloor  \sqrt{3m \log(4(m-1)/\epsilon) }\rfloor \}$. Then,

\begin{equation}
\left| g_{m}(x) - 4 \sum_{n=0}^{k_{0}} (-1)^{n} \left(\frac{\sum_{k=n+1}^{m}\binom{2m}{m+k}}{2^{2m}} \right) T_{2n+1}(x) \right| \le \epsilon
(\#eq:truncated-polynomial-expansion-of-gm)
\end{equation}

for all $x \in [-1, \, 1]$.
```


```{proof}
If $k_{0}=m-1$ then we know from Lemma \@ref(lem:polynomial-expansion-of-gm) that the left side of Inequality \@ref(eq:truncated-polynomial-expansion-of-gm) is 0. Next, assume $k_{0}<m-1$. Consider a random variable $X$ that follows the binomial distribution with parameters $2m$ and $1/2$. It is very simple to see that the expected value of this random variable is just $m$. The probability of getting at least $m+n+1$, $n<m$, successes in $2m$ independent Bernoulli trials is given by

$$ \Pr(X \ge m+n+1) = \frac{1}{2^{2m}} \sum_{k=n+1}^{m} \binom{2m}{m+k}.$$
On the other hand, the Chernoff bound of Theorem \@ref(thm:chernoff-bound) gives
$$ \Pr(X \ge m+k+1) = \Pr\left(X \ge \left(1+\frac{k+1}{m}\right)m\right) \le e^{\frac{-((k+1)/m)^{2}m}{2+(k+1)/m}} \le e^{-\frac{k^{2}}{3m}}.$$

Therefore, we have

\begin{equation*}
\begin{split}
\left| g_{m}(x) - 4 \sum_{n=0}^{k_{0}} (-1)^{n} \left( \frac{\sum_{k=n+1}^{m} \binom{2m}{m+k}}{2^{2m}} \right) T_{2n+1}(x) \right| & =  \left| 4 \sum_{n=k_{0}+1}^{m-1} \left(\frac{\sum_{k=n+1}^{m}\binom{2m}{m+k}}{2^{2m}} \right) T_{2n+1}(x) \right| \\
& \le  \left| 4 \sum_{n=k_{0}+1}^{m-1} e^{-\frac{k^{2}}{3m}} T_{2n+1}(x) \right| \\
& \le  4 (m-1) e^{-\frac{k_{0}^{2}}{3m}} \\
& \le  \epsilon, \\
\end{split}
\end{equation*}
where we used the fact the $T_{2n+1}(x)| \le 1$ for $x \in [-1, \, 1]$.
```

We are now ready to state the main result of this section that expresses the fact that the function $1/x$ can be approximated arbitrarily closely by linear combination of Chebyshev polynomials with a minimal number of terms.

::: {.corollary name="Low degree polynomial approximation of 1/x"}
Let $0<\epsilon<1$ and $0<\delta<1$. Set $m := \lceil \delta^{-2}\log(2/(\epsilon \delta))\rceil$ and $k_{0}:=0$ if $m=1$ or $k_{0}:=\min \{m-1, \, \lfloor  \sqrt{3m \log(8(m-1)/\epsilon) }\rfloor \}$ otherwise. Then,
\begin{equation}
\left| \frac{1}{x} - 4 \sum_{n=0}^{k_{0}} (-1)^{n} \left(\frac{\sum_{k=n+1}^{m}\binom{2m}{m+k}}{2^{2m}} \right) T_{2n+1}(x) \right| \le \epsilon
(\#eq:polynomial-approximation-of-inverse-of-x)
\end{equation}
for all $x \in [-1,\, -\delta ] \cup [\delta,\, 1 ]$.
::: 

```{proof}
This is an immediate consequence of Lemmas \@ref(lem:inverse-x-approximation), \@ref(lem:polynomial-expansion-of-gm), and \@ref(lem:polynomial-approximation-of-gm).
```


Interestingly, there is another way of obtaining the decomposition of $\frac{1-(1-x^2)^{2m}}{x}$ with Chebychev polynomials, which is more akin to the canonical way one would approach the problem, so we report it here as an exercise, as it might be more mathematically challenging to solve than what we presented before.

::: {.exercise}
Can you prove Lemma \@ref(lem:polynomial-expansion-of-gm) using the orthogonality of Chebychev polynomials?
:::

```{proof}
We give just a hint for the proof. 

\begin{equation}
\frac{1}{(1-x^2)^{2m}}{x} = \sum_{n=0}^\infty a_n T_n(x)
\end{equation}

what we want is an explicit expression for $a_n$, so we exploit the orthogonality of Chebychev polynomials (which are orthogonal under the measure $\mu(x)=\frac{1}{\sqrt{1-x^2}}$ on the interval $[1,1]$, i.e. we [have](https://en.wikipedia.org/wiki/Chebyshev_polynomials):

\begin{equation}
\int_{-1}^1 T_n(x)\,T_m(x)\,\frac{\mathrm{d}x}{\sqrt{1-x^2}} =
\begin{cases}
0             & if n \ne m, \\
\pi           & if n=m=0, \\
\frac{\pi}{2} & if n=m \ne 0.
\end{cases}
\end{equation}

Therefore, multiplying by a polynomial $T_l(x)$ and the measure $\mu(x)$ on both sides, we can read

\begin{equation}
\int_{-1}^{1} \frac{1-(1-x^2)^{2m}}{x} \frac{T_l(x)}{\sqrt{1-x^2}} = \int_{-1}^{1} \sum_{n=0}^\infty a_n \frac{T_nT_l}{\sqrt{1-x^2}} = \sum_{n=0}^\infty a_n \frac{\pi}{2}dx = \frac{\pi}{2} a_n.
\end{equation}

The task now is to compute the integral on the l.h.s..
```


## Polynomial approximation of other functions


# Contributions and  acknowledgements {#appendix-contributors}

Hello! I am [Alessandro Luongo](https://luongo.pro), and these are my first lecture notes in quantum algorithms! They spurred out from my old blog, back in 2016/2017. Then, they took a more concrete form out of my Ph.D. thesis (which I made at [IRIF](https://irif.fr) with the support of [Atos](https://atos.net), which I thank), and now are in this extended form with the hope to serve the future researchers in QML. While I strive to be as precise as the lecture notes of [Ronald de Wolf](https://homepages.cwi.nl/~rdewolf/qcnotes.pdf) and [Andrew Childs](https://www.cs.umd.edu/~amchilds/qa/qa.pdf), I know this work is still far from the level of quality I aspire to. If you want to give me any feedback, feel free to write me at "scinawa - at - luongo - dot - pro". Or contact me on [Twitter](https://twitter.com/scinawa).


```{r, out.width='20%', echo = FALSE}
knitr::include_graphics('images/scinawa.jpg')
```

Ciao,
Ale.


**Core team**

This work has been made possible only thanks to the help of the team of the Open-Source project of [quantumalgorithms.org](https://quantumalgorithms.org):


- Armando ['ikiga1'](https://twitter.com/ikiga1) Bellante

```{r, out.width='20%', echo = FALSE}
knitr::include_graphics('images/armando.jpeg')
```

- Hue Jun Hao Alexander
```{r, out.width='20%', echo = FALSE}
knitr::include_graphics('images/alexhue.jpeg')
```


**Contributors**

The [contributors](https://github.com/Scinawa/quantumalgorithms.org/graphs/contributors) to the project are:

- Patrick Rebentrost
- Yassine Hamoudi
- Martin Plávala
- Trong Duong
- Filippo Miatto
- Jinge Bao
- Michele Vischi
- Samantha Buck
- Adrian Lee
- Ethan Hansen
- Lei Fan
- Giacomo De Leva
- Pablo Rotondo
- João Doriguello
- Avhijit_Nair
- Marco Caselli

**Funding**

This website is supported by:

- [unitary.fund](https://unitary.fund).
- [Centre for Quantum Technologies](https://www.quantumlah.org/)


**Suppliers**

A big thanks to:

- [42LF](https://42lf.it) for the legal support,
- [Lorenzo Gecchelin](https://www.linkedin.com/in/gecchelinlorenzo/) for the graphics.


In sparse order, I would like to thank [Dong Ping Zhang](www.dongpingzhang.com), [Mehdi Mhalla](http://membres-lig.imag.fr/mhalla/) , [Simon Perdrix](https://members.loria.fr/SPerdrix/), [Tommaso Fontana](https://twitter.com/zommiommy), and [Nicola](https://www.linkedin.com/in/nvitucci/) [Vitucci](https://twitter.com/nvitucci) for the initial help with the previous version of this project, and the helpful words of encouragement.


## License and citation

<p xmlns:dct="http://purl.org/dc/terms/" xmlns:cc="http://creativecommons.org/ns#" class="license-text">The website <a rel="cc:attributionURL" property="dct:title" href="https://quantumalgorithms.org">quantumalgorithms.org</a> by <a rel="cc:attributionURL dct:creator" property="cc:attributionName" href="https://luongo.pro">Alessandro Luongo</a> is licensed under <a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0">CC BY-NC-SA 4.0<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" /><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" /><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1" /><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1" /></a></p>


```{text}
@misc{quantumalgorithms,
title={Quantum algorithms for data analysis},
author={Luongo, Alessandro},
url={https://quantumalgorithms.org},
year={2020}
}
```


The CSS style file comes from [here](https://m-clark.github.io/data-processing-and-visualization/).

## Cookie Policy

The website [https://quantumalgorithms.org](https://quantumalgorithms.org) (the “Website”) uses cookies for the following purposes: allowing
online authentication, monitoring sessions and memorising information on specific configurations of users
accessing the server.
This document sets out detailed information on the use of cookies and similar technology, how they are used
by the Website and how to manage them.
Visitors are able to configure their own browsers so that they are alerted to the use of cookies or they may
otherwise refuse them. Browser acceptance of cookies can be disabled by changing your settings.

#### DEFINITIONS

Cookies are small strings of text (letters or numbers) that allow a web server to memorise browser
information that can be used during the same session (session cookies) or at a later stage, even days later
(persistent cookies). Cookies are memorised in accordance with the user settings by the individual browser
on the device being used (computer, tablet, smartphone).

#### TYPES OF COOKIES

There are different categories of cookies and each has its own characteristics and uses:

- Technical cookies: this type of cookie is essential for a website to function properly and they are
only used to the extent required for the transmission of communications over an electronic
communication network, or to the extent strictly necessary for the supplier of an information service
explicitly requested by the subscriber or by the user to supply that service;
- Analytical cookies: this type of cookie is used to anonymously collect and analyse the traffic to and
use of a website. Without identifying the user, they make it possible for example, to detect whether
said user has subsequently accessed the website. They also make it possible to monitor the system
and enhance the services and user experience. These cookies may be disabled without affecting the
functioning of a website.
- Profiling cookies: these are persistent cookies used to identify (anonymously and not) user
preferences and to enhance the navigation experience.
- Third-party cookies (analytical and/or profiling): these are generated by companies other than
the actual website and integrated into a website's pages, for example Google widgets (such as
Google Maps) or social plugins (Facebook, Twitter, LinkedIn, Google+, etc.).
The management of information that is collected by a "third party" is regulated by the relevant privacy
statement, which you are requested to read. For ease of reference, they are indicated in the links set out
below.

#### TYPES OF COOKIES USED

The Website uses the following type of cookies:

- Third-party analytical cookies: Google Analytics, a web traffic analysis service provided by Google
Inc. (“Google”), which makes it possible to access and analyse detailed statistics on website visitors.
The Google Analytics service has been designed to use pre-anonymised data so as to conceal the
last part of the visitor's IP address. For further information, please see https://www.google.it/policies/privacy/partners/. Users may disable Google Analytics by installing on their browser the opt-out add-on tool provided by Google (please
see https://tools.google.com/dlpage/gaoptout)


#### DURATION

Some cookies (called session cookies) remain active until a user closes their browser. Other cookies (called
persistent cookies) "survive" the closure of the browser and are available in subsequent user visits. Their
duration is set by the server when they are created: in some cases, there is a set expiry date whereas in
other cases their duration is unlimited. However, they may always be deleted using browser settings.
The majority of the cookies we use are persistent and expire 2 years from the date
when they are downloaded onto the Visitor's device.

#### MANAGEMENT

Visitors may accept or refuse cookies via their browser settings.
Content may be accessed even if cookies are completely disabled and disabling "third–party technical"
cookies will not prevent a visitor from using a website. It could however adversely impact the User's
experience (insofar as it is not possible to memorise their data for future use).
Settings can be changed for different websites and/or website applications. Moreover, the leading browsers
allow users to change their settings depending on the type of cookie:

- Firefox: https://support.mozilla.org/it/kb/Gestione%20dei%20cookie
- Internet Explorer: https://support.microsoft.com/it-it/help/17442/windows-internet-explorer-delete-manage-cookies
- Chrome: https://support.google.com/chrome/answer/95647?hl=it
- Opera: http://help.opera.com/Windows/10.00/it/cookies.html
- Safari for Mac: https://support.apple.com/kb/PH21411?viewlocale=it_IT&amp;locale=it_IT
- Safari for iOS:  http://support.apple.com/kb/HT1677?viewlocale=it_IT

Third parties are hereby informed that the use of this policy, even partial, for other websites shall be subject
to sanctions by the Italian Data Protection Authority.
This page may be accessed via the link set out in the footer of all the website's pages, pursuant to Article
122(2) of Legislative Decree 196/2003 and the simplified process for privacy information and the acquisition
of consent to the use of cookies published on the Italian Official Journal no. 126 of 3 June 2014 and the
relevant register of measures 229 dated 8 May 2014.


# References