Metaphysics and Machine Learning

Perceptron and Backpropagation

A short essay I prepared for the course “Machine Learning at Scale” at York University. This text scores over 55 on the Flesch Reading Ease scale, which is pretty impressive for such technical material.

The Working of a Perceptron

A perceptron is a simple algorithm that can be trained as a binary classifier using supervised learning. It was invented in 1958 by Frank Rosenblatt at the Cornell Aeronautical Laboratory.

A very simple example of a perceptron contains 3 layers, an input layer, a hidden layer, and an output layer. Each layer contains a number of nodes. Each node passes values to each node in the successive layer. When only a single hidden layer exists, the perceptron can be called a shallow neural network.

A simple perceptron. Image from

Each input value is multiplied by a unique weight as it is passed to each node in the hidden layer. These weights are contained in a matrix having numbers of rows and columns equal to the number of nodes in the input and hidden layers. Additionally, there is a bias factor which is passed to the hidden layer, which allows the output curve to be moved with respect to the origin, without affecting its shape. The values from the nodes of the hidden layer are then passed along to the output layer for summation. Finally, an activation function is usually applied to map the input values onto the required output values, though for simplicity, not in the example being considered here.

If the inputs are y1 and y2, the weights are w[1,1], w[1,2], w[1,3], w[2,1], w[2,2], and w[2,3], and the bias value is b, the perceptron in the simple diagram above would calculate the output (ŷ) as:

ŷ = y1*w[1,1] + y1*w[1,2] + y1*w[1,3] + y2*w[2,1] + y2*w[2,2] + y2*w[2,3] + b

In any perceptron or neural network larger than that, writing out all the terms would be cumbersome, to say the least, and so this is usually done with summation notation:

(Screenshot from the original .pdf)

Where i is the number of inputs and j is the number of nodes in the hidden layer.


The weights and bias of a trained model can be learned by the algorithm through repeated applications of a process called backpropagation. To train a model using backpropagation, random values are initially used for the weights. For supervised learning, data that has been labeled with output values known to be valid is used. Using this training data as inputs, the output is calculated using the random weights. The output that is generated is compared to the labels in the training data using a cost function. The cost function is defined as the sum of the losses for each row of training data, where the loss can be defined as a measure of the difference between that output value and its corresponding label. Loss is measured differently for different applications. The lower the total difference between the outputs and the labels, the lower the value of the cost function.

To improve the predictive value of the model, the weights must be altered to reduce the value of the cost function. Backpropagation describes going back through the algorithm, and figuring out how to reduce the cost function by changing the weights. Training describes going back and forth through the algorithm, calculating outputs based on one set of weights, and then going back – backpropagating – to further reduce the cost function by changing the weights, then calculating inputs with those weights, and so forth.

As the cost function is continuous, calculus can be used to calculate the partial derivative of the cost function with respect to each weight in the matrix of weights. These partial derivatives, along with a learning rate, are used to calculate a new value for each weight that would lead to a lower value of the cost function (at least, when used in the context of the current set of weights, which are all changing as soon as all the partial derivatives have been calculated).

Once the weights have been set to their new values, the training process can begin another epoch by calculating the output again, based on the new weights, and then going through another round of backpropagation. The progress of training as it relates to data not contained in the training set can be monitored using a separate set of labeled data kept aside for cross-validation. Ideally, this process is repeated until the cost function can be reduced no further. At this point, the trained model can be evaluated using test data, which is labeled data that was not used for training or cross-validation. If the trained model’s performance is satisfactory, it can be deployed and used to perform inference, which is to compute meaningful outputs using new unlabeled data as inputs.

10 Machine Learning Ethics Mini-Essays

Appendix: Data Science Ethics

As a follow-up to the 10 questions about Machine Learning ethics, I tacked on these thoughts about the “Data Science Hippocratic Oath”

addressing the Professor at the end of the document:

I don’t know if you remember, but in the first lecture I made a point about asking about one of the items in the “Data Science Hippocratic Oath”, that said I should not “be overly impressed by Mathematics”. My inner Physicist, who is usually the first to think out loud about Mathematics, was, at first, baffled by the notion. Without Mathematics, Physics would be pretty lame. Same with Actuarial Science. Most Science I know of, wouldn’t even be Science without Math. I’m pretty sure the same is true of Data Science. What’s not to be overly impressed by? Though, you cited some good examples in the lecture that I think have helped me to come to an understanding of what it means. I think, it’s not about being impressed by the possibilities that a toolkit like Mathematics opens up. I think it’s more about one’s impression of individual acts of Mathematics. If something is wrong, or deceptive, or dangerous, or misleading, or seductive, or foolish, or half-baked, or ill-conceived, or malicious, or nonsense, or simply doesn’t lead to insights or solve a problem, it doesn’t get any extra points just because it happens to be Math, too.

My inner Physicist is already pretty comfortable with this idea. Euler’s Identity, for example, is a gobsmackingly impressive act of Mathematics, but only because you can watch, live, in real time, every day, as it helps humans to explain and predict real-world phenomena with blistering precision and reproduce-ability. The Math that predicted that BitCoin would be worth $650,000 each by now? Well, sure, it’s Math. May even be impressively well-conceived and well-executed Math. But it’s clearly nothing to be impressed by. It may be complete, and consistent, and perfect, but if it does not line up with the real world, it’s just noise. Science is the filter through which the impressiveness of an act of Mathematics can be determined, in my book. Here’s one of my heroes, Dr. Richard Feynman, speaking further on this: . So I have this as my take-away, and I’d be interested in any feedback you may have — I can be almost alarmingly impressed by Math or Science. But I am pretty much entirely unimpressed by Science that does not stand up under Math, or by Math that does not stand up under Science.