I’ve been wanting to get more involved with data engineering for a while, so I’ve started reading through Spark: The Definitive Guide. These are my notes on Part I, the introductory overview.
Part I. Gentle Overview of Big Data and Spark
Chapter 1. What is Apache Spark?
- Spark unifies libraries and utilities for various needs for working with large data, e.g. interactive data science, data engineering and ETL.
- Purely a computation engine; does not store data itself. Integrates with many data stores. This is a major difference from Hadoop, which had its own file system, HDFS, for storing data.
- Operates on data in memory, moves computations to where the data already is instead of moving the data around.
- Supports third-party libraries.
- Big motivator for Spark’s computation model: CPUs stopped getting faster and focus moved to parallelism through multiple cores. However, data storage did keep getting cheaper and faster, so there was more and more data, and the only way to work with it was across multiple machines in paralle.
- The original team at UC Berkeley who worked on Spark founded a company called Databricks.
Chapter 2. A Gentle Introduction to Spark
- Spark runs on a cluster, a group of machines. It manages and coordinates execution of tasks on data across the cluster.
- The cluster can be managed by Spark’s built-in cluster manager; YARN; or Mesos.
- Spark applications have driver and executor processes.
- The driver runs the main function from a single node, maintains information about the Spark application, responds to user code, and manages work on executors.
- The executors just carry out work assigned to them by the executor and report results.
- Spark driver programs can be written in many languages, which are translated internally into Spark calls for the executors to run:
- Scala: the implementation language of Spark.
- Java: through JVM interop, you can write Spark code in Java.
- Python: PySpark supports nearly all of the Scala API. Runs in an external process and communicates to the JVM process somehow (not explained in the book).
- SQL: A subset of ANSI SQL 2003.
- R: through SparkR in Spark core, or the community-contributed sparklyr. Also runs externally and communicates to JVM (somehow).
- Each language makes a Spark session available to run code.
- Spark session: sends user commands to Spark. Each Spark application has one and only one Spark session.
- Spark has two fundamental sets of APIs: lower-level “unstructured” (based around Resilient Distributed Datasets, AKA RDDs) and higher-level “structured” (based around datasets and dataframes).
- Most common structured API.
- Represents a table with rows and columns.
- Has a schema, which can be declared or inferred.
- One dataframe can span many machines. A section of a dataframe (i.e. a subset of the dataframe’s rows) living on a machine is called a partition.
- Dataframe partitions are not manipulated individually. If you need to manipulate a partition individually you can use resilient distributed datasets.
- Spark’s core data structures are immutable, so transformations produce new dataframes.
- Transformations are lazily evaluated when an action is executed. Until then, they’re just saved—this allows for optimizations, such as predicate pushdown, an optimization where a filter that would be executed later is executed earlier to reduce the amount of data that would need to be transformed.
- Some transformations are narrow-dependency—each input partition contributes to a single output partition. This means the output data can just stay on the same machine as the input partition.
map(* 2); you can transform each individual row of the input partition into a corresponding row in the output partition without moving the rows between partitions.
- Others are wide-dependency—each input partition contributes to multiple output partitions. This is also called a shuffle since partitions are broken apart and the pieces are exchanged across the nodes of a cluster.
- E.g. a sort; if the input data is randomly ordered, you could end up needing to pull apart a partition and send its rows off to multiple other machines to get everything in the right order.
- Narrow-dependency transformations can be done purely in memory. Wide-dependency writes to disk.
- Triggers computation of a result from the dataframe by running all the previously applied transformations.
- Three kinds of action:
- View data in a console
- Collect data to a language’s native object types
- Write data to output sinks
- Runs on port 4040 of the driver node.
Chapter 3. A Tour of Spark’s Toolset
spark-submitlets you run production applications.
- Lets you bind rows of a dataframe to a class you define instead of a generic row class.
- Datasets are typesafe; thus, they are not available in the dynamically typed Python and R, only in Java and Scala.
- Can use batch mode operations on streams of input data.
- The data is incrementally processed.
readStreamto read from a data source as a stream.
- Actions are slightly different; some actions don’t make sense (e.g.
count; you can never get a meaningful count because new data is constantly coming in). You still specify a write sink but it might be an in-memory table or some other source meant for constant incremental updates.
Machine Learning and Advanced Analytics
- MLLib: built-in library of machine learning algorithms and utilities.
- Aids in preprocessing, munging, training of models, and making predictions.
- Every machine learning algorithm has untrained and trained variants (e.g.
KMeansModelis trained). You create the untrained version and then train it to get the trained variant.
- RDDs (resilient distributed datasets) underlie basically all of Spark.
- Dataframes are built in RDDs
- RDDs expose details like individual partitions of data. They are different between Scala and Python, unlike dataframes.
- You pretty much never need to use them. There are very limited circumstances where you might want to.