This is a perceptron unit with training algorithm for multiple linear regression analysis of data written in Python.
A perceptron unit with no activation function can be used to carry out multiple linear regression analysis. Click here to view how the perceptron unit with a stepwise activation function can be used to classify linearly separable data.
The perceptron unit will accept two inputs. One is a matrix X (1), where m is the number of data points that will be used to train or test the model and n is the number of independent variables, or features, that will be used by the model. Every element in the first column xi,0 = 1.
The other input is a column vector of weights Ω (2), which contain the weight or coefficient ωi for each feature in X. The first element ω0 = b.
The mathematical model for the perceptron P is simply the product of X and Ω (3), where the first column in X and the first element in Ω give the bias b. Geometrically speaking, the perceptron generates a hyperplane having n slopes in n dimensions, where b is the vertical intercept. This hyperplane can be used to model data that is linearly associated.
The column vector Y (4) represents the dependent variable. Each element yi is an empirical measurement and the output of a hypothetical model after inputing values for xi,1, xi,2, ..., xi,n. If the hypothetical model is approximately linear, the perceptron P can be used to generate an actual model that approximates the hypothetical outputs in Y.
Measuring how close a model P is to approximating the hypothetical outputs can be achieved by calculating the sum of squared errors SSE (5), which is the sum of the squared differences between the model's output and the hypothetical outputs. These differences are called residuals. Equation (5) is also called the sum of squared residuals.
The model is as close as possible to approximating the hypothetical outputs when the SSE reaches a minimum value. When the number of data points m = 2, a plot of the SSE over p1 and p2 is a paraboloid, shown below. The minimum of the function is apparent. This is where the absolute value of the gradient of the SSE with respect to each element in P is precisely zero. Though harder to visualize, the plot is analogous for large values of m.
Before taking the derivative, it is useful to decompose the matrix operation as shown in (6).
Taking a partial derivative with respect to the kth element in Ω gives (7) after good use of the chain rule.
The second order partial derivative with respect to the kth element in Ω gives (8). The second order derivative gives the rate of change of the slope, which in this problem is always a positive constant.
It is easy to see how the decomposed equations (7) and (8) can be packed back into their respective matrix equations for all values of k, as shown in (9) and (10).
In order to find the minimum of the SSE function, the perceptron uses the gradient descent algorithm. This algorithm incrementally updates each coefficient ωi in Ω by a number proportional to the negative of the derivative of the SSE with respect to the coefficient ωi at Ω, shown in (11). The constant of proportionality r is called the learning rate.
For example, if a coefficient ωi in the model P is too large, the derivative of the SSE with respect to ωi at Ω will be positive, and ωi is decreased after the increment. If ωi is too small, the same derivative will be negative, and ωi is increased after the increment. The magnitude of the derivative, and therefore the increment, is proportional to how far off ωi is from the optimal value. As the gradient approaches the minimum of the SSE, the slopes will approach zero and the increments will become smaller and smaller.
The vector of weights Ω is initialized by generating a uniformly distributed random number for each of its elements ωi before training begins.
The training algorithm used in "perceptron.py" is shown in (12), where ⊘ denotes Hadamard division (element-wise division) and the convergence h << 1. This training algoirthm has some additional operations not present in (11). The learning rate is divided by the second derivative of the SSE function. A larger second derivative implies that the curvature of the SSE function is steeper and narrower, which means a smaller learning rate is necessary to avoid grossly overshooting the minimum. Dividing the learning rate by the second derivative reduces the potential for overshooting. This training algorithm also utilizes the Nesterov accelerated gradient algorithm, which is beyond the scope of this page.
Optimal values for the learning rate r and convergence h can be found by trial and error. If an overflow error is encountered, r is likely too large. If the algorithm does not converge, either r or h may be too small.
The "perceptron.py" module contains some basic optimization methods in its learning algorithm. If the data is not too large, it can train most linear data models in a timely manner, after an optimal value for r is found. However, mileage may vary. Refer to additional learning optimization methods if a sufficiently small value for h cannot be achieved.
The variance of the model predictions (13) is given by the mean square error MSE of the model after training is complete. The denominator m - n - 1 implies that the number of data points m used for training the model must be greater than the number of features n plus 1. The number 1 is there to account for b.
The variance in the weights of the model is given by the variance covariance matrix (14).
The total sum of squares TSS (15) is the sum over all squared differences between the measured outputs and their mean.
The coefficient of determination (16) measures the proportion of the variance in the measured outputs that is explained by the model's inputs.
To demonstrate, the "perceptron.py" module will be used on data from a 1978 paper entitled "The Correlation of Animal Response Data with the Yields of Selected Thermal Decomposition Products for Typical Aircraft Interior Materials" to fit a multiple linear regression model.
import pandas as pd, numpy as np
from perceptron import Network
The data is found in the file "air_int_incap.csv." The data contains measurements of time taken to incapacitate animals after being exposed to combustion yields of CO, HCN, H2S, HCL, HBR, NO2 and SO2 gas. Time is measured in minutes and the concentration of the gases is measured in parts per thousand. The perceptron will be used to build a linear model that will predict the approximate length of time it will take to incapacitate an animal when exposed to known concentrations of CO and NO2.
df = pd.read_csv( 'air_int_incap.csv' )
df.head()
category | id | time-to-incapacitation | 1000/tti | CO | HCN | H2S | HCl | HBr | NO2 | SO2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 20.0 | 2.36 | 423.7 | 164 | 6.4 | 0.0 | 0.0 | 0.0 | 0.26 | 0.0 |
1 | 1 | 14.0 | 2.38 | 420.2 | 174 | 7.5 | 0.0 | 0.0 | 5.0 | 1.07 | 0.0 |
2 | 1 | 1.0 | 2.61 | 383.1 | 96 | 4.7 | 0.0 | 33.0 | 5.0 | 0.08 | 0.0 |
3 | 1 | 2.0 | 3.07 | 325.7 | 101 | 7.5 | 0.0 | 0.0 | 7.1 | 0.43 | 0.0 |
4 | 1 | 61.0 | 3.07 | 325.7 | 142 | 6.8 | 0.0 | 27.6 | 0.0 | 0.25 | 0.0 |
Plotting concentration data over time shows an approximately exponential trend (which might imply a first order rate reaction). Therefore, the natural log of the concentrations was taken to produce an approximately linear trend.
df[ 'CO' ] = df[ 'CO' ].apply( lambda x : np.log( x ) if x else np.NaN )
df[ 'NO2' ] = df[ 'NO2' ].apply( lambda x : np.log( x ) if x else np.NaN )
The logorithm of zero is not defined. Rows containing undefined values should be removed.
df = df[ [ 'time-to-incapacitation', 'CO', 'NO2' ] ].dropna( axis = 0 )
The input matrix X and vector of hypothetical outputs Y are created.
X = df[ [ 'CO', 'NO2' ] ].to_numpy()
Y = df[ [ 'time-to-incapacitation' ] ].to_numpy()
The perceptron is trained on the data until a near minimum is reached.
network = Network()
network.train( X, Y, r = 0.01, h = 0.0000000001 )
After training, the model and its statistics can be printed to the screen. The R squared value is relatively low because the measured output data has a high amount of variance that was not captured in the model.
network.showModel()
+-------+---------------------------------------------------+
| Model | f( x1, x2 ) = 17.0(3.0) - 2.8(0.5)x1 - 0.7(0.2)x2 |
+-------+---------------------------------------------------+
| b | 17.072911943590032 |
+-------+---------------------------------------------------+
| w1 | -2.837442793250083 |
+-------+---------------------------------------------------+
| w2 | -0.7476215572004509 |
+-------+---------------------------------------------------+
| STDb | 2.5440732699408017 |
+-------+---------------------------------------------------+
| STDw1 | 0.5025604849784758 |
+-------+---------------------------------------------------+
| STDw2 | 0.18505563654052753 |
+-------+---------------------------------------------------+
| MSE | 4.160156822943904 |
+-------+---------------------------------------------------+
| R^2 | 0.6094678120459519 |
+-------+---------------------------------------------------+
The plot below shows the measured data on top of the computed model. The circles are the measured data. The wireframe grid is the model. The red vertical lines are the residuals. A contour plot of the model is projected onto the x-y plane. The animal will die the fastest where the color gradient is most red.
It comes with no surprise that higher concentrations of CO and NO2 lead to a faster death. This model predicts an approximate concentration of gases that will achieve instantaneous death, or at least incapacitation. The residuals show the variance in the measured output not captured by the model. Although this led to a relatively small R squared value, it is clear from the plot that this variance shows no clear trend and a linear model was most likely the best choice to capture the relationship between time to incapacitate an animal and the natural logorithm of CO and NO2 concentrations. You may investigate this plot more thoroughly by running "plot.py."