There is a large amount of data generated during testing of apps.
## Why this If our objective is to take action to resolve anomalies the quickest possible time, we know that simply calculating averages and standard deviation statistics is not enough because it's time-consuming even for an expert to figure out what is wrong among the mass of data being churned out.There is so much going on that it's difficult to understand the relationships among the data, to correlate cause vs. effect, and the cascade of events when a choke-point restricts capacity and cause performance bottlenecks.
The objective is to augment time-consuming and error-prone manual scanning of metrics with a program that recognizes patterns and raises alerts.
-
As performance tests run (also in production), proactively identify when a blocking condition is imminent. This is done by a "continuous" (periodic) scan through metrics and calculate leading indicators such as the trend of disk space usage and the ratio of memory consumed per user, etc.
This is a step ahead of reactively recognizing when a threshold for action, such as response time degrading, out of memory, CPU, out of disk space, etc.
-
As Selenium runs, have it output timings for each transaction. When a transaction takes a sudden jump, raise an alert. This means a constant scan comparing previous history for each transaction.
-
Also, we want to avoid innundating human reviewers with more alerts than they can handle. So the program also needs to prioritize.
Data output from a load test run may be filtered so just the "steady state" or peak values are analyzed.
## Statistics We start with traditional statistics by calculating the **95th percentile** not by selecting some data point value that can vary wildly from run to run, but by derviving a statistical focrmula so that we can compare results in a statistically valid way. For example, when we report a response time of 14 seconds at the 95th percentile, we can provide the percentage confidence, and what reduction needs to be for chance alone to have impacted the results.This is done by using a Linear Regression calculation such as http://templates.prediction.io/RAditi/PredictionIO-MLLib-LinReg-Template
BTW, data into these calculations is only the middle "steady state" portion of runs, which excludes ramp-up and ramp-down.
## Additional resources on machine learningThere are several courses on Coursera.org teaching Machine Learning.
A) By Andrew Ng makes use of Matlab and Octave for audio processing.
- Videos
- Lecture notes
- Wiki of Discussions, video subtitles maintained by students
-
https://isotope11.com/blog/continuous-deployment-at-isotope11-an-update
Code for Python 2.7 cannot be run under Python 3.5. Thus, python3 is the command instead of python.
- https://docs.python.org/2/index.html contains tutorials and docs.
Python can be used either for interactive “workbench” applications or embedded into other software and reused.
Scikit-learn builds on top of existing Python packages NumPy, SciPy, and matplotlib. Its Regression predicts a continuous-valued attribute associated with an object such as in the stock market.
Claiming that the Pandas (Python Data Analysis) open-source library does NOT scale, the ML class from Washington U at Coursera.org is based on comparable open-source SFrame API with a commercial (1 year free) GraphLab Create from Dato.com, where instructor Carlos is CEO.
It can be run within AWS EC2.
IPython Notebook combines Python with a wiki page (named Jupyter) that combines code, plots, and text, (after installation described below at default http://localhost:8888).
The jupyter web UI enables several clusters (engines) to be started. This is what provides the scalability.
- Install the SFrame tabular data structure (which provides a "database" stored on disk) by downloading from https://dato.com/learn/gallery/notebooks/introduction_to_sframes.html with source at https://github.com/dato-code/SFrame
- introduction_to_sframes.py - the Python code
- introduction_to_sframes.ipynb - the iPython Notebook
The package includes installation of Anaconda from https://www.continuum.io/downloads which distributes Python with 300+ libraries.
If you're not using the IPython installer, download, in the folder, install using this command-line:
```
bash Anaconda2-2.4.1-MacOSX-x86_64.sh ```
- Alternately, install Miniconda for its compactness as it requires individual packages to be installed.
Tasks include:
- Constructing data objects
- Accessing data in a table
- Vector arithmetic
- Saving and loading data tables
- Data table operations
- Manipulating data in a table
- Computing statistics with data tables
To load a csv file into SFrame:
```
import graphlab sf = graphlab.SFrame('your_data.csv') ```
Some other classes teach R, which is more a declarative language rather than a programmatic language like Python. R is said to be less scalable and has fewer deployment tools than Python, so it is more seldomly used for production code in industry.
## Machine Learning (ML) definedAccording to https://en.wikipedia.org/wiki/Machine_learning, in 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".
The term refers to the use of computers to recognize patterns (anomalies) and make predictions.
http://how-old.net, Microsoft's age-detecting robot proves).
KNIME.org takes a code-free approach to predictive analytics. Using a graphical workbench, you wire together workflows from an abundant library of processing nodes, which handle data access, transformation, analysis, and visualization. With KNIME, you can pull data from databases and big data platforms, run ETL transformations, perform data mining with R, and produce custom reports in the end.
There are different approaches to ML.
Unsupervised learning approach problems with little or no idea as to what the results should look like. This derives structure from data by clustering data based on relationships among the variables in the data. This calculates level of association among variables. This is not the same as expert systems applying rules.
Supervised learning tasks include "regression" and "classification".
-
Regression tries to predict results within a continuous output, mapping input variables to some continuous result function. Predicting the price of a house based solely on square footage is an example of using one input to predict a single output, also known as "univariate linear regression." But in real life, the regression models behind kbb.com provides Kelly Blue Book prices of automobiles based on several variables (miles driven, age of car, accessories, etc.).
-
Classification tries to predict results in a discrete output, mapping input variables into categories such as "Yes" or "No".
Supervised learning approaches works by providing a training set of known input and outputs so the ML program can derive its formula.
- The variable m represents the number of training examples.
- The variable x represents the various input feature variables.
- The variable y represents the various output target variables.
Numbers in superscipt slightly above each variable refer to a specific variable.
-
https://azure.microsoft.com/en-us/services/machine-learning/ part of the Cortana Analytics Suite http://gallery.cortanaanalytics.com/
-
Google has Tensorflow
An algorithm such as Gradient Boosting Trees (GBT) has a dozen parameter settings to tweek, such as how to control tree size, the learning rate, the sampling methodology for rows or columns, the loss function, the regularization options, and more.
The linear Support Vector Machine (SVM) algorithm is good at categorizing text.
Its Dimensionality reduction reduces the number of random variables to consider.
## Signal processing Apache Spark's scalable machine learning library for Java, plus Python and Scala (via NumPy) is faster than Hadoop map-reduce programs because it leverages faster solid-state drives.It works in EC2 or on Mesos.
It is made of these parts:
- A processing framework written in Scala built on top of Apache Spark with a scalable data collection and analytics layer
- Data storage on top of Apache HBase and Cassandra
- Template gallery of algorithms for customization (Spark MLib, Mahout, etc.)
- Event engine
PredictionIO's DASE (Data Source and Dat Preparer) architecture is the "MVC for Machine Learning" in that it provides web services so predictive engine components can be built with separation-of-concerns.
A sample call to a rating service is:
curl -H "Content-Type: application/json" -d '{ "user": 1, "num": 4 }' http://localhost:8000/queries.json
case class Query {
val user: String,
val num: Int
} extends Serializable
http://www.mlbase.org/ enables users to obtain results by make queries using a declarative language like SQL.
## Training to develop algorithm model To be able to identify anomalies, machine learning algorithms reference its learning from training activities, such as identifying the expected line (shown as a red line) based on several trails (gray lines).Graphic from http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Thus, there are two sets of data: training and test data.
Making conclusions and recommendations from test run outputs can be triggered from the prediction.io Event Server recognizing runs starting and completing.
To train predictive models ...
- https://docs.prediction.io/support/
- https://github.com/nathanlubchenco/bsw-prediction-io uses data from:
- https://www.kaggle.com/c/otto-group-product-classification-challenge/data
The defaults are:
- Installation path (/Users/wmar/PredictionIO):
- Vendor path (/Users/wmar/PredictionIO/vendors):
The concept of linear regression and correlation is not very new. Francis Galton invented the least-squares method in 1885.
### Zookeeper Zookeeper keeps in memory a hierarchy of znodes containing the state of the configuration of a distributed system. Its data is replicated among an ensemble. It sends heartbeats and watch events using a custom atomic messaging protocol. Changes to znodes (write requests from clients) are forwarded to a single leader server which broadcast to follower servers. , receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders. Developed at Yahoo! Research. to synch itself, including ACL changes.from https://zookeeper.apache.org/doc/trunk/zookeeperOver.html for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
## Alternative software http://jmlr.org/mloss/ provides a list of open-source software for machine learning.Among commercial tools, Skytree’s AutoModel automatically determines optimal parameters and algorithms for maximum model accuracy via their easy-to-use interface to guide program training, tuning, and testing models, while preventing statistical mistakes.
ENCOG for Java and C# is maintained by Heaton Research in St. Louse, Missouri. Samples are at https://github.com/encog
## Additional resources on machine learningThere are several courses on Coursera.org
-
By Ng makes use of Matlab.
-
The University of Washington makes use of Python.
-
http://colah.github.io/ at http://googleresearch.blogspot.com/
-
http://www.infoworld.com/article/2853707/machine-learning/11-open-source-tools-machine-learning.html
-
https://www.youtube.com/watch?v=v-91JycaKjc From the Lab to the Factory: Building a Production Machine Learning by Josh Wills (@josh_wills) is the Senior Director of Data Science at Cloudera
-
http://datasciencedojo.com/bootcamp/curriculum/ gives a 5-day overview of Data Science for $2500.