Categories
Metaphysics and Machine Learning

Hadoop, MapReduce and Spark

A short essay I prepared for the course “Machine Learning at Scale” at York University. This text scores 43 on the Flesch Reading Ease scale, which is not bad for such technical material.

Hadoop

Hadoop stands for High Availability Distributed Object-Oriented Platform.

Apache Hadoop is a collection of open-source software utilities that allows software applications to distribute both processing and storage across a network of commodity computers — computers that are easily available and cheap. This allows applications to process greater volumes of data, at much higher speed, than even a very expensive single computer could. Hadoop’s architecture makes it scalable in the extreme. Thousands of computers can be recruited for quick processing of vast volumes of data.

Hadoop Architecture

The essence of the architecture of Hadoop is that, even though data and processing are distributed across many computers, the processing of any given segment of data takes place in the same computer where that data is stored. This eliminates the onerous latency involved with processing data that is stored in different computers on the network.

A Hadoop cluster is made up of nodes. So that data can be processed where it is stored, each node participates in both the storage of data, and in the processing of that data. Every node has two layers, an HDFS (Hadoop Distributed File Storage) layer for storage, and a MapReduce layer for processing.

A node can be either a master node or a slave node. A master node is responsible for distributing processing logic and data to the slave nodes, co-ordinating processing, and executing contingency plans if slave nodes fail during processing. To co-ordinate processing, master nodes have a Job Tracker in their MapReduce layer. The JobTracker co-ordinates processing with TaskTrackers in the MapReduce layers of every node in the cluster. Master nodes also have a NameNode in their HDFS layer, which co-ordinates storage with DataNodes in the HDFS layers of all nodes in the cluster.

The primary interface to a Hadoop cluster is a JobClient, which allows users to submit jobs to the cluster and track their progress. The JobClient creates the data splits, submits the job to the JobTracker, monitors progress, and writes the final output to the Job Output directory.

A number of components have been created to assist developers in creating programs that harness Hadoop’s MapReduce functionality without having to interface with the actual Java code that Hadoop is implemented with. Two examples are Apache Pig and Apache Hive, and there are many more. Pig is a procedural language suited for programming and semi-structured data. Hive uses a declarative language that is familiar to SQL experts and is best suited to structured data and reporting. Architecturally speaking, these components, and others which provide an abstraction layer, are said to “sit on top of” Hadoop.

MapReduce

MapReduce is a framework, or programming model, for parallelizing the processing of very large datasets using large networks of computers. The name MapReduce was once a Google property, thoug hit is now used generically. An important component of Apache Hadoop is a robust, powerful, and popular implementation of MapReduce.

MapReduce splits the input data into independent chunks. At the direction of the JobTracker in the master node, these chunks are processed in parallel by all the TaskTrackers in the cluster. These smaller parallel tasks are known as map tasks. The output from these map tasks serve as input for the reduce tasks, which results in the same outcome as if the entire task had been performed without parallelization.

More simply, MapReduce is a tool that allows data processing tasks to be divided into chunks which are processed in parallel on many computers, and then processes the results of those partial tasks into a unified whole.

In versions of Hadoop prior to Hadoop2.0, MapReduce managed all requirements for resource allocation, job scheduling, and computation. Starting in Hadoop2.0, a system called YARN (Yet Another Resource Negotiator) was introduced to allocate and schedule computational resources for MapReduce jobs.

Apache Spark

Spark is a computing engine for parallel processing of Big Data. It supports a number of programming languages, including Python, Java, and R. It includes libraries for a number of popular technologies, including SQL (Spark SQL), data streaming (Structured Streaming), machine learning (MLlib), and graph analytics (GraphX). It can be run in a single computer, or on a cloud-based cluster with thousands of computers.

While Spark can be run as a standalone application without Hadoop, Spark does not include a storage mechanism. To use Spark in a cluster, it must be used in combination with a distributed storage mechanism. The most common approach is to use Spark with Hadoop YARN, which allows Spark to use Hadoop’s HDFS distributed storage mechanism. Another popular option is to use Spark with the Apache Mesos cluster manager and Amazon S3 storage technology.

One of the key advantages of Apache Spark over Hadoop MapReduce is speed. While MapReduce reads and writes to and from disk, Spark stores intermediate data in active memory. This allows it to process data up to 100 times faster than MapReduce for some tasks.

However, Spark’s use of active computer memory gives rise to its most common shortcoming — it can get bogged down by data when it runs out of active memory and has to go to the disk for storage. For extremely large datasets which can exceed the memory capacities of a cluster, Hadoop may be the better option.

Some of Spark’s other key advantages include the ability to process new data coming in in real time using Spark Streaming, a built-in machine learning library MLlib, and the ability to provide real-time data insights due to its in-memory processing.

Summary

Apache Hadoop is an open-source system that enables applications to parallelize the processing of vary large datasets across many computers. HDFS manages data storage, and MapReduce orchestrates the distributed processing. To reduce network latency, any given subset of data is processed in the same node where it is stored. Apache Spark is a computing engine that can run on Hadoop and potentially speed up processing over MapReduce through dynamic resource negotiation and the use of active in-memory storage of intermediate results.