Skip to content

Latest commit

 

History

History
68 lines (58 loc) · 3.78 KB

File metadata and controls

68 lines (58 loc) · 3.78 KB

SVM

  • linear separator (can be non-linear if use of a kernel)
  • not prone to the curse of dimensionality
  • not sensitive to outliers
  • not prone to overfitting
  • we search for the hyperplane that maximizes the margin between two classes
  • margin: distance between the hyperplane and the support vectors $=\frac{2}{||w||}$
  • we want to find $w$ that maximizes $y_{i}(w^{T}x_{i}+b)\ge 1$
  • we want to maximize the margin, that is minimize $||w||$ or $||w||^{2}$ (convex)
  • soft margin: maximize $y_{i}(w^{T}x_{i}+b)\ge (1-\xi_{i})$ with $\xi_{i}\ge0$
  • $\xi_{i}$ is a slack variable
  • misclassification when $\xi_{i} > 1$, in the margin when $1\ge\xi_{i}\ge0$
  • or minimize $\frac{1}{2}||w||^{2} + C\sum_{i}\xi_{i}$
  • $C$ is the penalty error term (positive):
    • $C$ small: soft-SVM
      • large margin
      • tolerates more errors
      • low variance, high bias
    • $C$ big: hard-SVM
      • small margin
      • tolerates no errors (overfitting)
      • low bias, high variance
  • the support vectors are the vectors that define the separating hyperplanes ($\alpha_{i} \ne 0$)
  • $h(x)=\sum_{i}^{N}\alpha_{i}y_{i}\vec{x_{i}}.\vec{x} + b$
  • $k(x, x') = <\varphi(x) ,\varphi(x')>$
  • The problem with this scalar product is that it is performed in a large dimensional space, which leads to impractical calculations.
  • The kernel trick is therefore to replace a scalar product in a large dimensional space with a kernel function that is easy to calculate. In this way, a linear classifier can easily be transformed into a non-linear classifier. Another advantage of kernel functions is that it is not necessary to specify the transformation φ.
  • $K(\mathbf{x},\mathbf{y})=\exp\left(- \frac{|\mathbf{x} - \mathbf{y}|^2}{2 \sigma^2}\right)$
  • for a function to be a kernel there must exist a function into a feature space such that the function output the same result as the dot product of the projected vectors.

One class SVM

One-Class SVM is similar

  • instead of using a hyperplane to separate two classes of instances
  • it uses a hypersphere to encompass all of the instances.
  • Now think of the "margin" as referring to the outside of the hypersphere
  • so by "the largest possible margin", we mean "the smallest possible hypersphere".

Get a confidence score

https://prateekvjoshi.com/2015/12/15/how-to-compute-confidence-measure-for-svm-classifiers/

Infographic

More