Categories
Metaphysics and Machine Learning

Perceptron and Backpropagation

A short essay I prepared for the course “Machine Learning at Scale” at York University. This text scores over 55 on the Flesch Reading Ease scale, which is pretty impressive for such technical material.

The Working of a Perceptron

A perceptron is a simple algorithm that can be trained as a binary classifier using supervised learning. It was invented in 1958 by Frank Rosenblatt at the Cornell Aeronautical Laboratory.

A very simple example of a perceptron contains 3 layers, an input layer, a hidden layer, and an output layer. Each layer contains a number of nodes. Each node passes values to each node in the successive layer. When only a single hidden layer exists, the perceptron can be called a shallow neural network.

A simple perceptron. Image from https://missinglink.ai/guides/neural-network-concepts/perceptrons-and-multi-layer-perceptrons-the-artificial-neuron-at-the-core-of-deep-learning/

Each input value is multiplied by a unique weight as it is passed to each node in the hidden layer. These weights are contained in a matrix having numbers of rows and columns equal to the number of nodes in the input and hidden layers. Additionally, there is a bias factor which is passed to the hidden layer, which allows the output curve to be moved with respect to the origin, without affecting its shape. The values from the nodes of the hidden layer are then passed along to the output layer for summation. Finally, an activation function is usually applied to map the input values onto the required output values, though for simplicity, not in the example being considered here.

If the inputs are y1 and y2, the weights are w[1,1], w[1,2], w[1,3], w[2,1], w[2,2], and w[2,3], and the bias value is b, the perceptron in the simple diagram above would calculate the output (ŷ) as:

ŷ = y1*w[1,1] + y1*w[1,2] + y1*w[1,3] + y2*w[2,1] + y2*w[2,2] + y2*w[2,3] + b

In any perceptron or neural network larger than that, writing out all the terms would be cumbersome, to say the least, and so this is usually done with summation notation:

(Screenshot from the original .pdf)

Where i is the number of inputs and j is the number of nodes in the hidden layer.

Backpropagation

The weights and bias of a trained model can be learned by the algorithm through repeated applications of a process called backpropagation. To train a model using backpropagation, random values are initially used for the weights. For supervised learning, data that has been labeled with output values known to be valid is used. Using this training data as inputs, the output is calculated using the random weights. The output that is generated is compared to the labels in the training data using a cost function. The cost function is defined as the sum of the losses for each row of training data, where the loss can be defined as a measure of the difference between that output value and its corresponding label. Loss is measured differently for different applications. The lower the total difference between the outputs and the labels, the lower the value of the cost function.

To improve the predictive value of the model, the weights must be altered to reduce the value of the cost function. Backpropagation describes going back through the algorithm, and figuring out how to reduce the cost function by changing the weights. Training describes going back and forth through the algorithm, calculating outputs based on one set of weights, and then going back – backpropagating – to further reduce the cost function by changing the weights, then calculating inputs with those weights, and so forth.

As the cost function is continuous, calculus can be used to calculate the partial derivative of the cost function with respect to each weight in the matrix of weights. These partial derivatives, along with a learning rate, are used to calculate a new value for each weight that would lead to a lower value of the cost function (at least, when used in the context of the current set of weights, which are all changing as soon as all the partial derivatives have been calculated).

Once the weights have been set to their new values, the training process can begin another epoch by calculating the output again, based on the new weights, and then going through another round of backpropagation. The progress of training as it relates to data not contained in the training set can be monitored using a separate set of labeled data kept aside for cross-validation. Ideally, this process is repeated until the cost function can be reduced no further. At this point, the trained model can be evaluated using test data, which is labeled data that was not used for training or cross-validation. If the trained model’s performance is satisfactory, it can be deployed and used to perform inference, which is to compute meaningful outputs using new unlabeled data as inputs.