You find the description of the Sine1, Sine2, Mixed, Stagger, Circles, and LED data streams as follows:
- Sine1 (with abrupt concept drift): It consists of two attributes
x
andy
uniformly distributed in[0, 1]
. The classification function isy = sin(x)
. Instances are classified as positive if they are under the curve; otherwise they are classified as negative. At a drift point, the class labels are reversed. - Sine2 (with abrupt concept drift): It holds two attributes of
x
andy
which are uniformly distributed in[0, 1]
. The classification function is0.5 + 0.3 * sin(3 * π * x)
. Instances under the curve are classified as positive while the other instances are classified as negative. At a drift point, the classification scheme is inverted. - Mixed (with abrupt concept drift): The dataset has two numeric attributes
x
andy
distributed in the interval[0, 1]
with two boolean attributesv
andw
. The instances are classified as positive if at least two of the three following conditions are satisfied:v, w, y < 0.5 + 0.3 * sin(3 * π * x)
. The classification is reversed when drift points occur. - Stagger (with abrupt concept drift): This dataset contains three nominal attributes, namely
size = {small, medium, large}, color={red, green}, and shape={circular, non-circular}
. Before the first drift point, instances are labeled positive if(color = red) ∧ (size = small)
. After this point and before the second drift, instances are classified positive if(color = green) v (shape = circular)
, and finally after this second drift point, instances are classified positive only if(size = medium) v (size = large)
. - Circles (with gradual concept drift): It has two attributes
x
andy
that are distributed in[0, 1]
. The function of circle<(x_c,y_c), r_c>
is(x - x_c)^2 + (y - y_c)^2 = r_c^2
where(x_c, y_c)
is its center andr_c
is the radius. Four circles of<(0.2, 0.5), 0.15>
,<(0.4, 0.5), 0.2>
,<(0.6, 0.5), 0.25>
, and<(0.8, 0.5), 0.3>
classify instances in order. Instances inside the circle are classified as positive. A drift happens whenever the classification function, i.e. circle function, changes. - LED (with gradual concept drift): The objective of this dataset is to predict the digit on a seven-segment display, where each digit has a
10%
chance of being displayed. The dataset has7
attributes related to the class, and17
irrelevant ones. Concept drift is simulated by interchanging relevant attributes.
As just mentioned, Sine1, Sine2, Mixed, Stagger, Circles have only 2
class labels, whereas LED has 10
class labels.
Typically, for the purpose of experiment, each data stream contains 100,000
or 1,000,000
instances. In the case of 100,000
instances, drift points may be put at every 20,000
instances in Sine1, Sine2, and Mixed, and at every 33,333
instances in Stagger with a transition length of w=50
to simulate abrupt concept drifts. For the Circles and LED data streams, concept drifts happen at every 25,000
instances with a transition length of w=500
to simulate gradual concept drifts. 10%
class noise may also added to each data stream.
You may download the zipped files for each group of data streams from DropBox Link.
You find .zip
files containing 100 .arff
files for each synthetic data stream in data_streams/synthetic/
.
- Electricity contains
45,312
instances, with8
input attributes, recorded every half an hour for two years from Australian New South Wales Electricity. The classification task is to predict a rise (Up) or a fall (Down) in the electricity price. The concept drift may happen because of changes in consumption habits, unexpected events, and seasonality [1]. - Forest CoverType has
54
attributes with581,012
instances describing7
forest cover types for30 X 30
meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data, for four wilderness areas located in the Roosevelt National Forest of northern Colorado [2]. - Poker hand comprises of
1,000,000
instances, where each instance is an example of a hand having five playing cards drawn from a standard deck of 52. Each card is described by two attributes (suit and rank), for ten predictive attributes. The class predicts the poker hand [3].
You may find pre-processed version of these data streams (or datasets) from https://moa.cms.waikato.ac.nz/datasets/. (Please note that regarding these data streams we do not know where/when concept drifts happen, or even if they contain any concept drifts [4, 5].)
- Adult: The original dataset has
6
numeric and8
nominal attributes, two class labels, and48,842
instances.32,561
instances are used for building the classification models. The dataset was used to predict whether a person earns an annual income greater than$50,000
[6]. - Nursery: The dataset consists of
5
classes labelled asno_recom
,recommend
,very_recom
,priority
, andspec_priority
. The number of occurrences of the third and fourth classes are infrequent and we removed them from the dataset, resulting in a dataset consisting of20,000
instances [7]. - Shuttle: The original dataset contains nine attributes,
7
class labels, and58,000
instances and was designed to predict suspicious states during a NASA shuttle mission [8].
I have created three evolving data streams from these three datasets, you find them in data_streams/other_benchmarks/
.
References
- Žliobaite I (2013) How good is the electricity benchmark for evaluating concept drift adaptation. arXiv preprint arXiv:13013524
- Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24(3):131–151
- Olorunnimbe MK, Viktor HL, Paquet E (2015) Intelligent adaptive ensembles for data stream mining: a high return on investment approach. In: International Workshop on New Frontiers in Mining Complex Patterns, Springer, pp 61–75
- Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 139–148
- Huang DTJ, Koh YS, Dobbie G, Bifet A (2015) Drift detection using stream volatility. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 417–432
- Kohavi R (1996) Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In: KDD, Citeseer, vol 96, pp 202–207
- Zupan B, Bohanec M, Bratko I, Demsar J (1997) Machine learning by function decomposition. In: ICML, pp 421–429
- Catlett J (2002) Statlog (shuttle) data set
Please cite the following papers if you plan to use the data streams:
- Pesaranghader, Ali, et al. "Reservoir of Diverse Adaptive Learners and Stacking Fast Hoeffding Drift Detection Methods for Evolving Data Streams", Machine Learning Journal, 2018.
Pre-print available at: https://arxiv.org/abs/1709.02457, DOI: https://doi.org/10.1007/s10994-018-5719-z - Pesaranghader, Ali, et al. "Fast Hoeffding Drift Detection Method for Evolving Data Streams", European Conference on Machine Learning, 2016.
Pre-print available at: http://iwera.ir/~ali/papers/ecml2016.pdf, DOI: https://doi.org/10.1007/978-3-319-46227-1_7 - Pesaranghader, Ali, et al. "McDiarmid Drift Detection Methods for Evolving Data Streams", International Joint Conference on Neural Networks, 2018.
Pre-print available at: https://arxiv.org/abs/1710.02030
Ali Pesaranghader © 2020++