Metaphysics and Machine Learning

Perceptron and Backpropagation

A short essay I prepared for the course “Machine Learning at Scale” at York University. This text scores over 55 on the Flesch Reading Ease scale, which is pretty impressive for such technical material.

The Working of a Perceptron

A perceptron is a simple algorithm that can be trained as a binary classifier using supervised learning. It was invented in 1958 by Frank Rosenblatt at the Cornell Aeronautical Laboratory.

A very simple example of a perceptron contains 3 layers, an input layer, a hidden layer, and an output layer. Each layer contains a number of nodes. Each node passes values to each node in the successive layer. When only a single hidden layer exists, the perceptron can be called a shallow neural network.

A simple perceptron. Image from

Each input value is multiplied by a unique weight as it is passed to each node in the hidden layer. These weights are contained in a matrix having numbers of rows and columns equal to the number of nodes in the input and hidden layers. Additionally, there is a bias factor which is passed to the hidden layer, which allows the output curve to be moved with respect to the origin, without affecting its shape. The values from the nodes of the hidden layer are then passed along to the output layer for summation. Finally, an activation function is usually applied to map the input values onto the required output values, though for simplicity, not in the example being considered here.

If the inputs are y1 and y2, the weights are w[1,1], w[1,2], w[1,3], w[2,1], w[2,2], and w[2,3], and the bias value is b, the perceptron in the simple diagram above would calculate the output (ŷ) as:

ŷ = y1*w[1,1] + y1*w[1,2] + y1*w[1,3] + y2*w[2,1] + y2*w[2,2] + y2*w[2,3] + b

In any perceptron or neural network larger than that, writing out all the terms would be cumbersome, to say the least, and so this is usually done with summation notation:

(Screenshot from the original .pdf)

Where i is the number of inputs and j is the number of nodes in the hidden layer.


The weights and bias of a trained model can be learned by the algorithm through repeated applications of a process called backpropagation. To train a model using backpropagation, random values are initially used for the weights. For supervised learning, data that has been labeled with output values known to be valid is used. Using this training data as inputs, the output is calculated using the random weights. The output that is generated is compared to the labels in the training data using a cost function. The cost function is defined as the sum of the losses for each row of training data, where the loss can be defined as a measure of the difference between that output value and its corresponding label. Loss is measured differently for different applications. The lower the total difference between the outputs and the labels, the lower the value of the cost function.

To improve the predictive value of the model, the weights must be altered to reduce the value of the cost function. Backpropagation describes going back through the algorithm, and figuring out how to reduce the cost function by changing the weights. Training describes going back and forth through the algorithm, calculating outputs based on one set of weights, and then going back – backpropagating – to further reduce the cost function by changing the weights, then calculating inputs with those weights, and so forth.

As the cost function is continuous, calculus can be used to calculate the partial derivative of the cost function with respect to each weight in the matrix of weights. These partial derivatives, along with a learning rate, are used to calculate a new value for each weight that would lead to a lower value of the cost function (at least, when used in the context of the current set of weights, which are all changing as soon as all the partial derivatives have been calculated).

Once the weights have been set to their new values, the training process can begin another epoch by calculating the output again, based on the new weights, and then going through another round of backpropagation. The progress of training as it relates to data not contained in the training set can be monitored using a separate set of labeled data kept aside for cross-validation. Ideally, this process is repeated until the cost function can be reduced no further. At this point, the trained model can be evaluated using test data, which is labeled data that was not used for training or cross-validation. If the trained model’s performance is satisfactory, it can be deployed and used to perform inference, which is to compute meaningful outputs using new unlabeled data as inputs.

Metaphysics and Machine Learning

Hadoop, MapReduce and Spark

A short essay I prepared for the course “Machine Learning at Scale” at York University. This text scores 43 on the Flesch Reading Ease scale, which is not bad for such technical material.


Hadoop stands for High Availability Distributed Object-Oriented Platform.

Apache Hadoop is a collection of open-source software utilities that allows software applications to distribute both processing and storage across a network of commodity computers — computers that are easily available and cheap. This allows applications to process greater volumes of data, at much higher speed, than even a very expensive single computer could. Hadoop’s architecture makes it scalable in the extreme. Thousands of computers can be recruited for quick processing of vast volumes of data.

Hadoop Architecture

The essence of the architecture of Hadoop is that, even though data and processing are distributed across many computers, the processing of any given segment of data takes place in the same computer where that data is stored. This eliminates the onerous latency involved with processing data that is stored in different computers on the network.

A Hadoop cluster is made up of nodes. So that data can be processed where it is stored, each node participates in both the storage of data, and in the processing of that data. Every node has two layers, an HDFS (Hadoop Distributed File Storage) layer for storage, and a MapReduce layer for processing.

A node can be either a master node or a slave node. A master node is responsible for distributing processing logic and data to the slave nodes, co-ordinating processing, and executing contingency plans if slave nodes fail during processing. To co-ordinate processing, master nodes have a Job Tracker in their MapReduce layer. The JobTracker co-ordinates processing with TaskTrackers in the MapReduce layers of every node in the cluster. Master nodes also have a NameNode in their HDFS layer, which co-ordinates storage with DataNodes in the HDFS layers of all nodes in the cluster.

The primary interface to a Hadoop cluster is a JobClient, which allows users to submit jobs to the cluster and track their progress. The JobClient creates the data splits, submits the job to the JobTracker, monitors progress, and writes the final output to the Job Output directory.

A number of components have been created to assist developers in creating programs that harness Hadoop’s MapReduce functionality without having to interface with the actual Java code that Hadoop is implemented with. Two examples are Apache Pig and Apache Hive, and there are many more. Pig is a procedural language suited for programming and semi-structured data. Hive uses a declarative language that is familiar to SQL experts and is best suited to structured data and reporting. Architecturally speaking, these components, and others which provide an abstraction layer, are said to “sit on top of” Hadoop.


MapReduce is a framework, or programming model, for parallelizing the processing of very large datasets using large networks of computers. The name MapReduce was once a Google property, thoug hit is now used generically. An important component of Apache Hadoop is a robust, powerful, and popular implementation of MapReduce.

MapReduce splits the input data into independent chunks. At the direction of the JobTracker in the master node, these chunks are processed in parallel by all the TaskTrackers in the cluster. These smaller parallel tasks are known as map tasks. The output from these map tasks serve as input for the reduce tasks, which results in the same outcome as if the entire task had been performed without parallelization.

More simply, MapReduce is a tool that allows data processing tasks to be divided into chunks which are processed in parallel on many computers, and then processes the results of those partial tasks into a unified whole.

In versions of Hadoop prior to Hadoop2.0, MapReduce managed all requirements for resource allocation, job scheduling, and computation. Starting in Hadoop2.0, a system called YARN (Yet Another Resource Negotiator) was introduced to allocate and schedule computational resources for MapReduce jobs.

Apache Spark

Spark is a computing engine for parallel processing of Big Data. It supports a number of programming languages, including Python, Java, and R. It includes libraries for a number of popular technologies, including SQL (Spark SQL), data streaming (Structured Streaming), machine learning (MLlib), and graph analytics (GraphX). It can be run in a single computer, or on a cloud-based cluster with thousands of computers.

While Spark can be run as a standalone application without Hadoop, Spark does not include a storage mechanism. To use Spark in a cluster, it must be used in combination with a distributed storage mechanism. The most common approach is to use Spark with Hadoop YARN, which allows Spark to use Hadoop’s HDFS distributed storage mechanism. Another popular option is to use Spark with the Apache Mesos cluster manager and Amazon S3 storage technology.

One of the key advantages of Apache Spark over Hadoop MapReduce is speed. While MapReduce reads and writes to and from disk, Spark stores intermediate data in active memory. This allows it to process data up to 100 times faster than MapReduce for some tasks.

However, Spark’s use of active computer memory gives rise to its most common shortcoming — it can get bogged down by data when it runs out of active memory and has to go to the disk for storage. For extremely large datasets which can exceed the memory capacities of a cluster, Hadoop may be the better option.

Some of Spark’s other key advantages include the ability to process new data coming in in real time using Spark Streaming, a built-in machine learning library MLlib, and the ability to provide real-time data insights due to its in-memory processing.


Apache Hadoop is an open-source system that enables applications to parallelize the processing of vary large datasets across many computers. HDFS manages data storage, and MapReduce orchestrates the distributed processing. To reduce network latency, any given subset of data is processed in the same node where it is stored. Apache Spark is a computing engine that can run on Hadoop and potentially speed up processing over MapReduce through dynamic resource negotiation and the use of active in-memory storage of intermediate results.

Metaphysics and Machine Learning


Metaphysics and Machine Learning.

Welcome to Metaphysics and Machine Learning!

Brand new site. More coming soon!