Metaphysics and Machine Learning

Perceptron and Backpropagation

A short essay I prepared for the course “Machine Learning at Scale” at York University. This text scores over 55 on the Flesch Reading Ease scale, which is pretty impressive for such technical material.

The Working of a Perceptron

A perceptron is a simple algorithm that can be trained as a binary classifier using supervised learning. It was invented in 1958 by Frank Rosenblatt at the Cornell Aeronautical Laboratory.

A very simple example of a perceptron contains 3 layers, an input layer, a hidden layer, and an output layer. Each layer contains a number of nodes. Each node passes values to each node in the successive layer. When only a single hidden layer exists, the perceptron can be called a shallow neural network.

A simple perceptron. Image from

Each input value is multiplied by a unique weight as it is passed to each node in the hidden layer. These weights are contained in a matrix having numbers of rows and columns equal to the number of nodes in the input and hidden layers. Additionally, there is a bias factor which is passed to the hidden layer, which allows the output curve to be moved with respect to the origin, without affecting its shape. The values from the nodes of the hidden layer are then passed along to the output layer for summation. Finally, an activation function is usually applied to map the input values onto the required output values, though for simplicity, not in the example being considered here.

If the inputs are y1 and y2, the weights are w[1,1], w[1,2], w[1,3], w[2,1], w[2,2], and w[2,3], and the bias value is b, the perceptron in the simple diagram above would calculate the output (ŷ) as:

ŷ = y1*w[1,1] + y1*w[1,2] + y1*w[1,3] + y2*w[2,1] + y2*w[2,2] + y2*w[2,3] + b

In any perceptron or neural network larger than that, writing out all the terms would be cumbersome, to say the least, and so this is usually done with summation notation:

(Screenshot from the original .pdf)

Where i is the number of inputs and j is the number of nodes in the hidden layer.


The weights and bias of a trained model can be learned by the algorithm through repeated applications of a process called backpropagation. To train a model using backpropagation, random values are initially used for the weights. For supervised learning, data that has been labeled with output values known to be valid is used. Using this training data as inputs, the output is calculated using the random weights. The output that is generated is compared to the labels in the training data using a cost function. The cost function is defined as the sum of the losses for each row of training data, where the loss can be defined as a measure of the difference between that output value and its corresponding label. Loss is measured differently for different applications. The lower the total difference between the outputs and the labels, the lower the value of the cost function.

To improve the predictive value of the model, the weights must be altered to reduce the value of the cost function. Backpropagation describes going back through the algorithm, and figuring out how to reduce the cost function by changing the weights. Training describes going back and forth through the algorithm, calculating outputs based on one set of weights, and then going back – backpropagating – to further reduce the cost function by changing the weights, then calculating inputs with those weights, and so forth.

As the cost function is continuous, calculus can be used to calculate the partial derivative of the cost function with respect to each weight in the matrix of weights. These partial derivatives, along with a learning rate, are used to calculate a new value for each weight that would lead to a lower value of the cost function (at least, when used in the context of the current set of weights, which are all changing as soon as all the partial derivatives have been calculated).

Once the weights have been set to their new values, the training process can begin another epoch by calculating the output again, based on the new weights, and then going through another round of backpropagation. The progress of training as it relates to data not contained in the training set can be monitored using a separate set of labeled data kept aside for cross-validation. Ideally, this process is repeated until the cost function can be reduced no further. At this point, the trained model can be evaluated using test data, which is labeled data that was not used for training or cross-validation. If the trained model’s performance is satisfactory, it can be deployed and used to perform inference, which is to compute meaningful outputs using new unlabeled data as inputs.

Metaphysics and Machine Learning

Hadoop, MapReduce and Spark

A short essay I prepared for the course “Machine Learning at Scale” at York University. This text scores 43 on the Flesch Reading Ease scale, which is not bad for such technical material.


Hadoop stands for High Availability Distributed Object-Oriented Platform.

Apache Hadoop is a collection of open-source software utilities that allows software applications to distribute both processing and storage across a network of commodity computers — computers that are easily available and cheap. This allows applications to process greater volumes of data, at much higher speed, than even a very expensive single computer could. Hadoop’s architecture makes it scalable in the extreme. Thousands of computers can be recruited for quick processing of vast volumes of data.

Hadoop Architecture

The essence of the architecture of Hadoop is that, even though data and processing are distributed across many computers, the processing of any given segment of data takes place in the same computer where that data is stored. This eliminates the onerous latency involved with processing data that is stored in different computers on the network.

A Hadoop cluster is made up of nodes. So that data can be processed where it is stored, each node participates in both the storage of data, and in the processing of that data. Every node has two layers, an HDFS (Hadoop Distributed File Storage) layer for storage, and a MapReduce layer for processing.

A node can be either a master node or a slave node. A master node is responsible for distributing processing logic and data to the slave nodes, co-ordinating processing, and executing contingency plans if slave nodes fail during processing. To co-ordinate processing, master nodes have a Job Tracker in their MapReduce layer. The JobTracker co-ordinates processing with TaskTrackers in the MapReduce layers of every node in the cluster. Master nodes also have a NameNode in their HDFS layer, which co-ordinates storage with DataNodes in the HDFS layers of all nodes in the cluster.

The primary interface to a Hadoop cluster is a JobClient, which allows users to submit jobs to the cluster and track their progress. The JobClient creates the data splits, submits the job to the JobTracker, monitors progress, and writes the final output to the Job Output directory.

A number of components have been created to assist developers in creating programs that harness Hadoop’s MapReduce functionality without having to interface with the actual Java code that Hadoop is implemented with. Two examples are Apache Pig and Apache Hive, and there are many more. Pig is a procedural language suited for programming and semi-structured data. Hive uses a declarative language that is familiar to SQL experts and is best suited to structured data and reporting. Architecturally speaking, these components, and others which provide an abstraction layer, are said to “sit on top of” Hadoop.


MapReduce is a framework, or programming model, for parallelizing the processing of very large datasets using large networks of computers. The name MapReduce was once a Google property, thoug hit is now used generically. An important component of Apache Hadoop is a robust, powerful, and popular implementation of MapReduce.

MapReduce splits the input data into independent chunks. At the direction of the JobTracker in the master node, these chunks are processed in parallel by all the TaskTrackers in the cluster. These smaller parallel tasks are known as map tasks. The output from these map tasks serve as input for the reduce tasks, which results in the same outcome as if the entire task had been performed without parallelization.

More simply, MapReduce is a tool that allows data processing tasks to be divided into chunks which are processed in parallel on many computers, and then processes the results of those partial tasks into a unified whole.

In versions of Hadoop prior to Hadoop2.0, MapReduce managed all requirements for resource allocation, job scheduling, and computation. Starting in Hadoop2.0, a system called YARN (Yet Another Resource Negotiator) was introduced to allocate and schedule computational resources for MapReduce jobs.

Apache Spark

Spark is a computing engine for parallel processing of Big Data. It supports a number of programming languages, including Python, Java, and R. It includes libraries for a number of popular technologies, including SQL (Spark SQL), data streaming (Structured Streaming), machine learning (MLlib), and graph analytics (GraphX). It can be run in a single computer, or on a cloud-based cluster with thousands of computers.

While Spark can be run as a standalone application without Hadoop, Spark does not include a storage mechanism. To use Spark in a cluster, it must be used in combination with a distributed storage mechanism. The most common approach is to use Spark with Hadoop YARN, which allows Spark to use Hadoop’s HDFS distributed storage mechanism. Another popular option is to use Spark with the Apache Mesos cluster manager and Amazon S3 storage technology.

One of the key advantages of Apache Spark over Hadoop MapReduce is speed. While MapReduce reads and writes to and from disk, Spark stores intermediate data in active memory. This allows it to process data up to 100 times faster than MapReduce for some tasks.

However, Spark’s use of active computer memory gives rise to its most common shortcoming — it can get bogged down by data when it runs out of active memory and has to go to the disk for storage. For extremely large datasets which can exceed the memory capacities of a cluster, Hadoop may be the better option.

Some of Spark’s other key advantages include the ability to process new data coming in in real time using Spark Streaming, a built-in machine learning library MLlib, and the ability to provide real-time data insights due to its in-memory processing.


Apache Hadoop is an open-source system that enables applications to parallelize the processing of vary large datasets across many computers. HDFS manages data storage, and MapReduce orchestrates the distributed processing. To reduce network latency, any given subset of data is processed in the same node where it is stored. Apache Spark is a computing engine that can run on Hadoop and potentially speed up processing over MapReduce through dynamic resource negotiation and the use of active in-memory storage of intermediate results.

Metaphysics and Machine Learning


Metaphysics and Machine Learning.

Welcome to Metaphysics and Machine Learning!

Brand new site. More coming soon!

10 Machine Learning Ethics Mini-Essays


10 Mini-Essays on Machine Learning Ethics

The following is an assignment I did at York University, sliced into a series of blog entries. The structure has been adjusted to suit the blog format, but the words are unchanged.

I had way more fun than I should have with this assignment, and even wrote each mini-essay in a different style. (Because some of them are very technical, Fleisch reading scores range from 22 to 64. This introduction scores a 68.) This was the only “straight writing” we did in the whole course—the rest of the writing was about our projects.

These 10 mini-essays together were worth 10% of the course mark. I very deliberately spent one hour on each. Despite my irreverent tone, and excessive focus on my own experience, I did score an “A” grade for the assignment, if only barely.

10 Machine Learning Ethics Mini-Essays

1. Unemployment.

What happens after the end of jobs?

I have mixed feelings about making jobs obsolete. When talk turns to “Look at trucking: it currently employs millions of individuals in the United States alone,” (quoted from original assignment text) I like to invoke the image of millions of individuals who lost their jobs tilling fields with oxen. Seriously, every single person whose survival is not tied to tilling fields with oxen is better off, whether they’re making the most of it or not. We have been automating jobs for a very, very long time. I was on a software development team in the 1990s that probably put many chemical, paper, and printing workers out of their jobs, by creating the first web-based system that allowed remote digital soft proofing of print jobs. A great advancement in efficiency! The advent of consumer digital photography arrived soon after, wiping out even more of those jobs related to the processing of film. I certainly don’t feel bad about all the chemicals that are no longer manufactured and set loose on the World. Or all that paper. But do I feel bad for those workers? Maybe I do. Did they live in a place where their financial security and modern skills re-training were assured by the country to which they had been paying taxes? HA! They probably had to get jobs driving trucks.

The hope is, as we automate jobs, we can move on to bigger and better things. We don’t till fields with oxen, we drive trucks. We don’t drive trucks, we’re YouTubers. The YouTubers of today have no idea what will be a hot job when AIs become the most efficient purveyors of video content. We certainly know that Machine Learning Specialist is going to be pretty hot in the coming years. But how good at preparing people for the future is our society now? In What Happens if Robots Take The Jobs? The Impact of emerging technologies on employment and public policy[1], Darrell West advances a number of ideas that address the need for society to adapt to disruptive technological change. “There needs to be ways for people to live fulfilling lives even if society needs relatively few workers,” West writes.One recommendation is “retraining accounts”, which are publicly funded accounts for fundingof re-training. This approach purports to offer the availability of free education, without as much potential for people becoming full-time students and not returning to the workforce.

Also important is Curricular Reform. In an age of constant change, it is important for school boards to rapidly adapt to the changing demands of the job market. I have witnessed this myself – my 12-year-old daughter is making websites in her Grade 7 class. Once the domain of specialists, technology now allows almost anyone with creative spirit to perform this task. This is part of the process by which technology turns into progress. As once we learned to use machines to till our fields and truck our vegetables, we now learn to use machines to publish and distribute our ideas to the world. Finally, West speaks of an “artisanal economy”, in which mundane tasks such as driving and plowing with oxen are performed by machines, while humans participate in the supply and demand of art, culinary delights, music, research, websites, YouTube videos, exploration, and the like. Sounds very Utopian. But I fear we will need more than re-training accounts and modernized curricula to get there.


10 Machine Learning Ethics Mini-Essays

2. Inequality.

How do we distribute the wealth created by machines?

One simple, effective, but unpalatable solution to this would be to tax the owners of the machines that do the jobs formerly done by humans, and then provide a basic minimum income to all the humans. AI or no AI, wealth inequality and a lack of access to quality education are serious issues already. As our ability to automate human activity accelerates, so too does the impact of these problems. Education is very difficult for people struggling to pay the bills. In an ideal world, or a well-run country, there would be thousands of people in programs just like York’s Machine Learning Certificate, and they would be able to focus on theprograms, rather than scrambling full-time to keep their lives together and doing what they can at school in the meantime. And so I believe, in addition to the educational improvements outlined in Question #1, Unemployment, it’s really time for society to disrupt the disrupting power of disruptive technologies by implementing a Basic Minimum Income.

Unfortunately, the political resistance to this idea is very strong. Ontario was going to do experiments with Basic Minimum Income — but our habitual lurching from governing party to governing party have left that experiment on the cutting room floor. People, especially in North America, don’t like the idea of giving people money for free. Or, the idea of “stealing” tax money from the job creators. But at some point, we must recognize that their primary role,is not that of job creators. They create jobs if they absolutely have to. If they can automate instead, rest assured, for the benefit of the shareholders, they will. As robots and AIs do more and more of the work, and less and less people are needed to do these tasks, it seems reasonable that the robots, or their owners, are rewarded a little less for the service they no longer provide to society. Charles Kenny looks at ethical and practical issues surrounding Basic Minimum Income in Give Poor People Cash[2]. Central to his thesis, and that of others who have studied this and other social programs, is that it is the most efficient, flexible,and productive way to deliver a social safety net. Basic Minimum Income completely avoids the inefficiencies and frauds associated with benefits programs that are targeted, controlled, or conditional. It also catalyzes economic growth through the flexibility it brings to how the money is spent, or invested, into the local economy. Including, of course, allowing people to focus more on education and become more productive members of society.


10 Machine Learning Ethics Mini-Essays

3. Humanity.

How do machines affect our behavior and interaction?

I found this to be the most difficult question of this assignment. Or, perhaps, the one that hit the closest to home. My background is very diverse, but often revolves around affecting behaviour and interaction with machines. I’m a “user experience (UX) designer”, and I have been one since long before there was such a term. I’ve also had a bit of a crush on AI since I was a teenager. My first attempt at re-programming an ELIZA script to mess with people was back in 1986, and I’ve done it numerous times since. But only recreationally. I was heavily involved in Google AdWords some years back, and I had very mixed feelings about what I was doing. On one hand, I may well have saved some very worthwhile companies by using their marketing budget in a powerful new way few people had figured out yet. On the other hand, I felt like all I was doing was getting people to spend their money and behave in a way that I could reasonably predict and take advantage of. I could tell I was really interested in the tools under the hood that were making it possible — but just harnessing those tools to get people to spend money in a particular way, as though they were rats in a maze, was not exactly my idea of a good time. I did a lot of A/B testing in order to manipulate my clients’ customer bases more effectively. One step further down that road, and I’d have been making click-bait and selling democracy out to whoever wanted to pay me to do it.So like I say, this question hits close to home. And I did not know where to start.

And then Andrew Ng’s newsletter, The Batch[3], arrived in my Inbox. I’m a real sucker for Andrew Ng,but when he writes stuff like this, I almost start feeling like a fanboi:

I wrote about ethics last week, and the difficulty of distilling ethical AI engineering into a few actionable principles. Marie Kondo, the famous expert on de-cluttering homes, teaches that if an item doesn’t spark joy, then you should throw it out. When building AI systems, should we think about whether we’re bringing joy to others?This leaves plenty of room for interpretation. I find joy in hard work, helping others, increasing humanity’s efficiency, and learning. I don’t find joy in addictive digital products. I don’t expect everyone to have the same values, but perhaps you will find this a useful heuristic for navigating the complicated decision of what to work on: Is your ML project bringing others joy?This isn’t the whole answer, but I find it a useful initial filter.

Andrew Ng

This definitely works for me. And if I have the power to choose, this is one of the things I will choose. I have often gone in this direction. My greatest personal satisfaction from 25 years of working with the web was to create an interactive website that taught millions of people to play ukulele. How much joy is that? Though, I did sully it by monetizing it with AdSense, treating my own students like rats in a maze. Come to think of it, my very first machine learning project was 100% joy oriented. I had just finished watching Mr. Ng’s entire Deep Learning series. I went over to and executed my very first line of Python ever. I stayed up late, picking through the layers of a pre-trained VGG-19 convnet to get just the style matrices that would let me make what I wanted. My brother-in-law is an artist, and I wanted to bring him joy, by getting a machine to paint him in the style of one of his paintings. I made these[4][5]:

I had used machine learning to bring my family joy. Onward and upward!


[4] [inline]

[5] [inline]

10 Machine Learning Ethics Mini-Essays

4. Artificial Stupidity.

How can we guard against mistakes?

In order to guard against mistakes, any system must have robust human oversight. This idea will be explored in Question #7. While guarding against mistakes is always critical, it is essential that we learn what Artificial Stupidity is, what conditions or processes lead to its creation, and how we can avoid wasting untold human potential upstream and downstream from its creation. Squandering resources on an Artificial Stupidity would be a mistake. Setting one loose and letting people live with the consequences would be another mistake. In an interview about the book Rebooting AI[6], Gary Marcus says

But right now AI is dangerous, and not in the way that Elon Musk is worried about. But in the way of job interview systems that discriminate against women no matter what the programmers do because the techniques that they use are too unsophisticated. I want us to have better AI. I don’t want us to have an AI winter where people realize this stuff doesn’t work and is dangerous, and they don’t do anything about it.

Gary Marcus

Marcus believes that Classical AI, which is more of a rules-based framework for building cognitive models, can play a role in transcending Artificial Stupidity. “The machine-learning stuff is pretty good at learning from data, but it’s very poor at representing the kind of abstraction that computer programs represent. Classical AI is pretty good at abstraction, but it all has to be hand-coded, and there is too much knowledge in the world to manually input everything. So it seems evident that what we want is some kind of synthesis that blends these approaches.” An AI system that was capable of understanding when its own decisions were going off the rails because of a subtle shift in the data would still require human oversight. Butit would require less intervention and fewer mistakes would be made.


10 Machine Learning Ethics Mini-Essays

5. Racist Robots.

How do we eliminate AI bias?

The problem of human bias expressing itself through machine learning algorithms requires deliberate intervention to minimize its impact. Because ML systems are trained using data generated through human activity, the biases expressed in that human activity express themselves automatically in any system that trains on that data. Google recently disbanded its AI Ethics Board, after employees protested the board’s inclusion of Heritage Foundation president Kay Coles James[7]. Ms. James and the Heritage Foundation have a hateful agenda.One can easily imagine Ms. James, among other things, obstructing any guidelines advocating the deliberate removal of human bias from ML systems on the basis of Divine Provenance or Data Sovereignty or Freedom from Censorship or Sick of Political Correctness or some nonsense like that. All of which is to say, bias doesn’t just exist in data, we must also remain watchful for groups and individuals who work towards ensuring that biases remain in important systems. Or, as data scientist Cathy O’Neil says in the TED Talk that is part of our course material[8],

Data laundering. It’s a process whereby technologists hide ugly truths inside black-box algorithms, and call them objective. Call them meritocratic.

I did not understand at first, why Google’s employees wanted this board shut down. The more I read about the characters who had infiltrated it, the more it became clear. As Ms O’Neil concludes,

Data scientists, we should not be the arbiters of the Truth. We should be translators of ethical discussions that happen in larger society.

My best understanding so far of eliminating bias from ML systems comes from the field of Natural Language Processing. Language processing algorithms are trained on vast amounts of real human language usage. Bias that occurs in a corpus of text will be incorporated into a word embedding, and that bias will be inherited by any machine learning algorithm that trains using that word embedding. The most effective way to remove this bias is to modify the word embedding itself, identifying words that suffer from bias, and mathematically eliminating or reducing that bias as much as possible. I saw an excellent description of how to remove gender bias from a word embedding, in the Andrew Ng lectures. Words like male, female, boy, girl, King, and Queen do not need gender bias removed, because they are gender specific. Words like doctor, nurse, engineer, strong, and attractive, on the other hand, will often suffer from a biased gender correlation in any corpus of text. We can “move” these words in our word embedding, to the nearest point that is equidistant from male and female. By manually selecting which words will be treated this way, we can retain the gender distinction between words like King and Queen, while hopefully erasing any gender correlation with words like doctor and nurse. While this may seem like an artificial, heuristic, or hackish approach, it is the kind of intervention that is required in a world where our data will be biased. The same would be true of training an algorithm to predict re-offending criminals, or to process job applications, or to identify human beings walking in traffic. We have to be curious about and aware of the biases in the data we are using, we have to be willing to remove that bias even if we face an opposing agenda, and where it is not possible to de-bias the data itself, we must be aware of the biases that do exist, and then deliberately and mathematically eliminate them from model training, to whatever extent possible.



10 Machine Learning Ethics Mini-Essays

6. Security.

How do we keep AI safe from adversaries?

The only thing that can stop a bad guy with an AI is a good guy with an AI.

AIs don’t commit crimes – but bad people can use AIs to commit crimes. Law enforcement agencies, including Canada’s own Royal Canadian Mounted Police, are turning to Artificial Intelligence to detect crime[9]. The Canadian Security Intelligence Service works to protect Canada from the emerging threat of state-sponsored espionage powered by AI[10]. We must have strong Laws and international treaties that allow our intelligence services, our law enforcement agencies and our justice system to discourage, detect, disrupt, and prosecute the crimes of the future.