The Perceptron Unit for Multiple Linear Regression

This is a perceptron unit with training algorithm for multiple linear regression analysis of data written in Python.

Mathematical Model

A perceptron unit with no activation function can be used to carry out multiple linear regression analysis. Click here to view how the perceptron unit with a stepwise activation function can be used to classify linearly separable data.

The perceptron unit will accept two inputs. One is a matrix X (1), where m is the number of data points that will be used to train or test the model and n is the number of independent variables, or features, that will be used by the model. Every element in the first column x_i,0 = 1.

The other input is a column vector of weights Ω (2), which contain the weight or coefficient ω_i for each feature in X. The first element ω₀ = b.

The mathematical model for the perceptron P is simply the product of X and Ω (3), where the first column in X and the first element in Ω give the bias b. Geometrically speaking, the perceptron generates a hyperplane having n slopes in n dimensions, where b is the vertical intercept. This hyperplane can be used to model data that is linearly associated.

Regression Analysis

The column vector Y (4) represents the dependent variable. Each element y_i is an empirical measurement and the output of a hypothetical model after inputing values for x_i,1, x_i,2, ..., x_i,n. If the hypothetical model is approximately linear, the perceptron P can be used to generate an actual model that approximates the hypothetical outputs in Y.

Measuring how close a model P is to approximating the hypothetical outputs can be achieved by calculating the sum of squared errors SSE (5), which is the sum of the squared differences between the model's output and the hypothetical outputs. These differences are called residuals. Equation (5) is also called the sum of squared residuals.

The model is as close as possible to approximating the hypothetical outputs when the SSE reaches a minimum value. When the number of data points m = 2, a plot of the SSE over p₁ and p₂ is a paraboloid, shown below. The minimum of the function is apparent. This is where the absolute value of the gradient of the SSE with respect to each element in P is precisely zero. Though harder to visualize, the plot is analogous for large values of m.

Before taking the derivative, it is useful to decompose the matrix operation as shown in (6).

Taking a partial derivative with respect to the k^th element in Ω gives (7) after good use of the chain rule.

The second order partial derivative with respect to the k^th element in Ω gives (8). The second order derivative gives the rate of change of the slope, which in this problem is always a positive constant.

It is easy to see how the decomposed equations (7) and (8) can be packed back into their respective matrix equations for all values of k, as shown in (9) and (10).

Gradient Descent Training Algorithm

In order to find the minimum of the SSE function, the perceptron uses the gradient descent algorithm. This algorithm incrementally updates each coefficient ω_i in Ω by a number proportional to the negative of the derivative of the SSE with respect to the coefficient ω_i at Ω, shown in (11). The constant of proportionality r is called the learning rate.

For example, if a coefficient ω_i in the model P is too large, the derivative of the SSE with respect to ω_i at Ω will be positive, and ω_i is decreased after the increment. If ω_i is too small, the same derivative will be negative, and ω_i is increased after the increment. The magnitude of the derivative, and therefore the increment, is proportional to how far off ω_i is from the optimal value. As the gradient approaches the minimum of the SSE, the slopes will approach zero and the increments will become smaller and smaller.

The vector of weights Ω is initialized by generating a uniformly distributed random number for each of its elements ω_i before training begins.

The training algorithm used in "perceptron.py" is shown in (12), where ⊘ denotes Hadamard division (element-wise division) and the convergence h << 1. This training algoirthm has some additional operations not present in (11). The learning rate is divided by the second derivative of the SSE function. A larger second derivative implies that the curvature of the SSE function is steeper and narrower, which means a smaller learning rate is necessary to avoid grossly overshooting the minimum. Dividing the learning rate by the second derivative reduces the potential for overshooting. This training algorithm also utilizes the Nesterov accelerated gradient algorithm, which is beyond the scope of this page.

Optimal values for the learning rate r and convergence h can be found by trial and error. If an overflow error is encountered, r is likely too large. If the algorithm does not converge, either r or h may be too small.

The "perceptron.py" module contains some basic optimization methods in its learning algorithm. If the data is not too large, it can train most linear data models in a timely manner, after an optimal value for r is found. However, mileage may vary. Refer to additional learning optimization methods if a sufficiently small value for h cannot be achieved.

Model Statistics

The variance of the model predictions (13) is given by the mean square error MSE of the model after training is complete. The denominator m - n - 1 implies that the number of data points m used for training the model must be greater than the number of features n plus 1. The number 1 is there to account for b.

The variance in the weights of the model is given by the variance covariance matrix (14).

The total sum of squares TSS (15) is the sum over all squared differences between the measured outputs and their mean.

The coefficient of determination (16) measures the proportion of the variance in the measured outputs that is explained by the model's inputs.

Try It

To demonstrate, the "perceptron.py" module will be used on data from a 1978 paper entitled "The Correlation of Animal Response Data with the Yields of Selected Thermal Decomposition Products for Typical Aircraft Interior Materials" to fit a multiple linear regression model.

import pandas as pd, numpy as np
from perceptron import Network

The data is found in the file "air_int_incap.csv." The data contains measurements of time taken to incapacitate animals after being exposed to combustion yields of CO, HCN, H₂S, HCL, HBR, NO₂ and SO₂ gas. Time is measured in minutes and the concentration of the gases is measured in parts per thousand. The perceptron will be used to build a linear model that will predict the approximate length of time it will take to incapacitate an animal when exposed to known concentrations of CO and NO₂.

df = pd.read_csv( 'air_int_incap.csv' )
df.head()

	category	id	time-to-incapacitation	1000/tti	CO	HCN	HCl	HBr	NO2
0	1	20.0	2.36	423.7	164	6.4	0.0	0.0	0.26
1	1	14.0	2.38	420.2	174	7.5	0.0	5.0	1.07
2	1	1.0	2.61	383.1	96	4.7	33.0	5.0	0.08
3	1	2.0	3.07	325.7	101	7.5	0.0	7.1	0.43
4	1	61.0	3.07	325.7	142	6.8	27.6	0.0	0.25

Plotting concentration data over time shows an approximately exponential trend (which might imply a first order rate reaction). Therefore, the natural log of the concentrations was taken to produce an approximately linear trend.

df[ 'CO' ] = df[ 'CO' ].apply( lambda x : np.log( x ) if x else np.NaN )
df[ 'NO2' ] = df[ 'NO2' ].apply( lambda x : np.log( x ) if x else np.NaN )

The logorithm of zero is not defined. Rows containing undefined values should be removed.

df = df[ [ 'time-to-incapacitation', 'CO', 'NO2' ] ].dropna( axis = 0 )

The input matrix X and vector of hypothetical outputs Y are created.

X = df[ [ 'CO', 'NO2' ] ].to_numpy()
Y = df[ [ 'time-to-incapacitation' ] ].to_numpy()

The perceptron is trained on the data until a near minimum is reached.

network = Network()
network.train( X, Y, r = 0.01, h = 0.0000000001 )

After training, the model and its statistics can be printed to the screen. The R squared value is relatively low because the measured output data has a high amount of variance that was not captured in the model.

network.showModel()

+-------+---------------------------------------------------+
| Model | f( x1, x2 ) = 17.0(3.0) - 2.8(0.5)x1 - 0.7(0.2)x2 |
+-------+---------------------------------------------------+
|   b   | 17.072911943590032                                |
+-------+---------------------------------------------------+
|  w1   | -2.837442793250083                                |
+-------+---------------------------------------------------+
|  w2   | -0.7476215572004509                               |
+-------+---------------------------------------------------+
| STDb  | 2.5440732699408017                                |
+-------+---------------------------------------------------+
| STDw1 | 0.5025604849784758                                |
+-------+---------------------------------------------------+
| STDw2 | 0.18505563654052753                               |
+-------+---------------------------------------------------+
|  MSE  | 4.160156822943904                                 |
+-------+---------------------------------------------------+
|  R^2  | 0.6094678120459519                                |
+-------+---------------------------------------------------+

The plot below shows the measured data on top of the computed model. The circles are the measured data. The wireframe grid is the model. The red vertical lines are the residuals. A contour plot of the model is projected onto the x-y plane. The animal will die the fastest where the color gradient is most red.

It comes with no surprise that higher concentrations of CO and NO₂ lead to a faster death. This model predicts an approximate concentration of gases that will achieve instantaneous death, or at least incapacitation. The residuals show the variance in the measured output not captured by the model. Although this led to a relatively small R squared value, it is clear from the plot that this variance shows no clear trend and a linear model was most likely the best choice to capture the relationship between time to incapacitate an animal and the natural logorithm of CO and NO₂ concentrations. You may investigate this plot more thoroughly by running "plot.py."

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
photos		photos
README.md		README.md
air_int_incap.csv		air_int_incap.csv
functions.py		functions.py
perceptron.py		perceptron.py
plot.py		plot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Perceptron Unit for Multiple Linear Regression

Mathematical Model

Regression Analysis

Gradient Descent Training Algorithm

Model Statistics

Try It

About

Releases

Packages

Languages

zachary-chiodini/Unit-Perceptron-Regression

Folders and files

Latest commit

History

Repository files navigation

The Perceptron Unit for Multiple Linear Regression

Mathematical Model

Regression Analysis

Gradient Descent Training Algorithm

Model Statistics

Try It

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages