01_introduction.tex

\chapter{Introduction}
\label{c:introduction}

In a straightforward definition, anomaly detection describes a method that compares expected with observed behavior and, in doing that, tries to find the observations that are most dissimilar from the norm. When the dissimilarity between expected and observed behavior is quantified, it yields a class label or suspicion score that can be used to mark observations for further investigation or even to make a decision based on a threshold automatically \citep{bolton_statistical_2002}. Training an anomaly detection system most often yields a categorical classifier that can then process a new observation and assign it such a suspicion score \citep{hayes_contextual_2015}.

Anomaly detection methods are often used for fraud detection and fraud prevention. Fraud detection describes cases in which fraud (e.g., finding misuse of someone's credit card) is to be detected after the fact. In contrast, fraud prevention tries to prevent fraud from happening at all (e.g., blocking a fraudulent transaction before it goes through). Further use cases for anomaly detection include parts of factory machines and vehicles like airplanes that could be observed to ensure their reliability, or computer behavior and networks that could be observed to detect intrusions \citep{chandola_anomaly_2012}. Additionally, malicious actions (e.g., vandalism) in social networks and public knowledge bases could potentially be addressed with automated anomaly detection systems \citep{heindorf_vandalism_2016}.

Modern systems are often highly complex and composed of many components, making it hard to understand the relationships and connections across components. A large number of components leads to a high potential for failure, and the missing understanding makes it challenging to find the root of any such failure. Traditional multi-class classification (i.e., training a system to detect specific failures based on domain knowledge) is therefore not suitable for anomaly detection on such systems, as the potential failures are not enumerable or not even known \citep{pimentel_review_2014}. Anomaly detection on complex modern systems can, therefore, be interpreted as a one-class classification problem: the ``normal'' class is very well-sampled, and its distribution can be estimated through training, but the anomalous classes are too undersampled to learn explicitly. In one-class classification, once a model of ``normality'' has been built, new events and patterns are evaluated against that model and scored based on the likelihood of them belonging to the ``normal'' class. If that anomaly likelihood is sufficiently large, a pattern can be deemed anomalous and handled accordingly \citep{pimentel_review_2014}.

While anomaly detection is a standard term, outlier detection and novelty detection are commonly-used synonyms. All of these terms apply to the one-class classification problem as described here, but they all stem from different application domains and are not used universally. Outliers often carry the connotation of being undesired data points in a dataset that can have a substantial effect on statistical analyses and should, therefore, be filtered out before any further processing. Similarly, anomalies refer to data points that are irregular and lie far from the mean or other means of normality in the dataset, but that are potentially important to know about \citep{pimentel_review_2014}.

Past research on anomaly detection has established several different categories of anomaly detection systems: probabilistic approaches learn the distribution of the training data and estimate the probability of new data being generated by that distribution using statistical measures; distance-based approaches use a (metric) distance measure to compute clusters or neighborhoods and classify based on membership or neighbors; reconstruction-based approaches learn to predict expected observations and classify based on the error between prediction and observation; domain-based approaches learn a classification boundary and classify based thereon; information-theoretic approaches evaluate whether the addition of new observations alters the entropy (or another measure) of the dataset \citep{pimentel_review_2014}.

The main challenges in anomaly detection include the significant imbalance of classes (i.e., there are many more normal observations than anomalies), potential uncertainties in class membership (i.e., an observation could belong to a class only with a certain probability), the high importance of time to detection (i.e., a fraud should be found as quickly as possible), as well as the commercial compromises regarding the cost of misclassification errors (i.e., a false negative might be much more costly than a false positive) and finding vs. not finding a fraud \citep{bolton_statistical_2002}.

Many existing approaches for anomaly detection do not scale well to large datasets as the training time grows overproportionately to the number of observations, or the testing time (i.e., when making a prediction) grows as more observations arrive \citep{schneider_expected_2016}. Also, many existing approaches assume that the entire dataset fits into memory and are not capable of processing observations as they arrive (i.e., online), which means that dedicated algorithms are needed for streaming anomaly detection \citep{dos_santos_teixeira_data_2010}.

Furthermore, when we process an evolving data stream instead of working on batch data, the potential evolution of normal behavior becomes one of the most crucial issues. For example, when a consumer changes his spending behavior, the new normal monthly spending on his credit card might lie at a higher level, which an anomaly detection system would need to recognize and adapt to \citep{bolton_statistical_2002}. Additionally, the computational complexity (e.g., an algorithm’s usage of computing time and system resources) of algorithms that are applied in a streaming context is often critical, as streaming systems in practice are overloaded with a high-volume and high-velocity inflow of new observations. It is generally impractical to store all of the data arriving in a stream, so algorithms should only process each observation once to improve efficiency \citep{schneider_expected_2016, hayes_contextual_2015}.

Comparing different anomaly detection methods is challenging in terms of capabilities as well as performance, as even the definition of what is an anomaly differs significantly across application domains \citep{hayes_contextual_2015}. To improve on that situation and allow for more coherent evaluation, \cref{c:features} introduces a foundational set of features using which one could investigate an anomaly detection approach. Subsequently, \cref{c:approaches} introduces a collection of research on streaming anomaly detection, focusing on an analysis of the features as introduced in \cref{c:features}.