Spark: The Definitive Guide Notes. Part I.

I’ve been wanting to get more involved with data engineering for a while, so I’ve started reading through Spark: The Definitive Guide. These are my notes on Part I, the introductory overview.

Part I. Gentle Overview of Big Data and Spark

Chapter 1. What is Apache Spark?

  • Spark unifies libraries and utilities for various needs for working with large data, e.g. interactive data science, data engineering and ETL.
  • Purely a computation engine; does not store data itself. Integrates with many data stores. This is a major difference from Hadoop, which had its own file system, HDFS, for storing data.
  • Operates on data in memory, moves computations to where the data already is instead of moving the data around.
  • Supports third-party libraries.
  • Big motivator for Spark’s computation model: CPUs stopped getting faster and focus moved to parallelism through multiple cores. However, data storage did keep getting cheaper and faster, so there was more and more data, and the only way to work with it was across multiple machines in paralle.
  • The original team at UC Berkeley who worked on Spark founded a company called Databricks.

Chapter 2. A Gentle Introduction to Spark

  • Spark runs on a cluster, a group of machines. It manages and coordinates execution of tasks on data across the cluster.
  • The cluster can be managed by Spark’s built-in cluster manager; YARN; or Mesos.
  • Spark applications have driver and executor processes.
    • The driver runs the main function from a single node, maintains information about the Spark application, responds to user code, and manages work on executors.
    • The executors just carry out work assigned to them by the executor and report results.
  • Spark driver programs can be written in many languages, which are translated internally into Spark calls for the executors to run:
    • Scala: the implementation language of Spark.
    • Java: through JVM interop, you can write Spark code in Java.
    • Python: PySpark supports nearly all of the Scala API. Runs in an external process and communicates to the JVM process somehow (not explained in the book).
    • SQL: A subset of ANSI SQL 2003.
    • R: through SparkR in Spark core, or the community-contributed sparklyr. Also runs externally and communicates to JVM (somehow).
  • Each language makes a Spark session available to run code.
  • Spark session: sends user commands to Spark. Each Spark application has one and only one Spark session.
  • Spark has two fundamental sets of APIs: lower-level “unstructured” (based around Resilient Distributed Datasets, AKA RDDs) and higher-level “structured” (based around datasets and dataframes).

Dataframes

  • Most common structured API.
  • Represents a table with rows and columns.
  • Has a schema, which can be declared or inferred.
  • One dataframe can span many machines. A section of a dataframe (i.e. a subset of the dataframe’s rows) living on a machine is called a partition.
  • Dataframe partitions are not manipulated individually. If you need to manipulate a partition individually you can use resilient distributed datasets.

Transformations

  • Spark’s core data structures are immutable, so transformations produce new dataframes.
  • Transformations are lazily evaluated when an action is executed. Until then, they’re just saved—this allows for optimizations, such as predicate pushdown, an optimization where a filter that would be executed later is executed earlier to reduce the amount of data that would need to be transformed.
  • Some transformations are narrow-dependency—each input partition contributes to a single output partition. This means the output data can just stay on the same machine as the input partition.
  • E.g. map(* 2); you can transform each individual row of the input partition into a corresponding row in the output partition without moving the rows between partitions.
  • Others are wide-dependency—each input partition contributes to multiple output partitions. This is also called a shuffle since partitions are broken apart and the pieces are exchanged across the nodes of a cluster.
  • E.g. a sort; if the input data is randomly ordered, you could end up needing to pull apart a partition and send its rows off to multiple other machines to get everything in the right order.
  • Narrow-dependency transformations can be done purely in memory. Wide-dependency writes to disk.

Actions

  • Triggers computation of a result from the dataframe by running all the previously applied transformations.
  • Three kinds of action:
    • View data in a console
    • Collect data to a language’s native object types
    • Write data to output sinks

Spark UI

  • Runs on port 4040 of the driver node.

Chapter 3. A Tour of Spark’s Toolset

  • spark-submit lets you run production applications.

Datasets

  • Lets you bind rows of a dataframe to a class you define instead of a generic row class.
  • Datasets are typesafe; thus, they are not available in the dynamically typed Python and R, only in Java and Scala.

Structured Streaming

  • Can use batch mode operations on streams of input data.
  • The data is incrementally processed.
  • Use readStream to read from a data source as a stream.
  • Actions are slightly different; some actions don’t make sense (e.g. count; you can never get a meaningful count because new data is constantly coming in). You still specify a write sink but it might be an in-memory table or some other source meant for constant incremental updates.

Machine Learning and Advanced Analytics

  • MLLib: built-in library of machine learning algorithms and utilities.
  • Aids in preprocessing, munging, training of models, and making predictions.
  • Every machine learning algorithm has untrained and trained variants (e.g. KMeans is untrained, KMeansModel is trained). You create the untrained version and then train it to get the trained variant.

Lower-Level APIs

  • RDDs (resilient distributed datasets) underlie basically all of Spark.
  • Dataframes are built in RDDs
  • RDDs expose details like individual partitions of data. They are different between Scala and Python, unlike dataframes.
  • You pretty much never need to use them. There are very limited circumstances where you might want to.

Notes on Modern Operating Systems, Chapter 2, Part 3: Scheduling

This is the last part of Chapter 2 of Modern Operating Systems at long last. I never noticed when I was in college how incredibly long the chapters in these textbooks are; most of the chapters in Modern Operating Systems are around a hundred pages. It was less of an issue with Computer Networking because the chapter was like 75% fluff and examples that one could skim over for understanding, but Modern Operating Systems is very text-dense, so breaking the chapter down into parts so I could sustain a sense of momentum and avoid feeling like I was on an endless slog through the same chapter was of great psychological benefit.

2.5 Scheduling

  • Need to choose which process gets the CPU next when it’s free.

2.5.1 Intro to Scheduling

  • Scheduling is much less important on PCs because most of the time is spent waiting for user input, not on CPU.
  • It’s much more important on batch systems, mainframes, and servers.
  • Process switching is expensive so scheduling must also mind the efficiency of CPU use:
    • Must go into kernel mode
    • Then must save all state of the current process
    • And save the memory map (the memory reference bits in the page table)
    • Then select a new process
    • Then reload the MMU with the memory map of the new process
    • Start the new process
    • And also the entire cache will be invalidated, so you have to reload it.
  • Processes alternate bursts of computing with waits on IO (disk, network, user response). Processes with long bursts of CPU usage are called CPU-bound. Those with short bursts of CPU usage are called I/O-bound.
  • These terms are defined in terms of length of CPU burst, not length of I/O wait, because I/O waits don’t really depend on the process; they’re determined by external factors like the speed of the disk or how long it takes a user to respond.
  • As CPUs get faster, processes become more I/O bound.
  • When scheduling happens:
    • When a process starts
    • When a process exits and a new one needs to start
    • When a process blocks
    • When an I/O interrupt happens—can choose to let the formerly blocked process run now that its I/O is available.
  • Clock interrupts provide an opportunity for scheduling to occur. In preemptive systems, the running process will be suspended and another scheduled after some number of clock interrupts. In non-preemptive systems, clock interrupts are not allowed to interrupt the running process.
  • Batch systems will be non-preemptive; there’s no reason to not finish running a job all at once.
  • Interactive systems will be preemptive. E.g. they need to simulate running a word processor and a web browser at the same time, neither of which have a set exit point and both of which might need to accept user input.
  • Real time systems often don’t use preemption; processes are simply written to exit very quickly.
  • Goals of scheduling algorithms:
    • Fairness. Make sure similar processes are treated similarly.
    • Keep CPU and I/O devices busy.
    • In batch systems:
    • Throughput: number of jobs per unit time.
    • Turnaround time: average time to complete jobs
    • In interactive systems:
    • Response time: make sure user gets a response quickly.
    • Proportionality: conform to users’ perception of how long something should take.
    • In real time systems:
    • Hitting deadlines
    • Predictability

2.5.2 Scheduling in Batch Systems

  • I skipped this because I don’t care about batch systems.

2.5.3 Scheduling in Interactive Systems

  • Generally each process gets a quantum, i.e. a unit of time which it is allowed to run for before the clock interrupts it. This must be chosen carefully. Too short means lots of inefficient process switching. Too long means we waste time running processes that are waiting on I/O.
  • Round Robin: just cycle through all the processes.
  • Priority Scheduling: Each process gets a priority. Choose the highest priority process to run. Degrade priorities as processes run to avoid starvation of lower priority processes.
  • Shortest Process Next: Estimate how long each process will take using a weighted average of the time previous runs took. Choose the one predicted to be shortest.
  • Lottery Scheduling: Give each process lottery tickets. At scheduling time, choose a ticket at random and run the process holding that ticket. Good for allocating fixed percentages of the CPU time; to give a process 25% of the CPU, give it 25% of the available tickets.

2.5.4 Scheduling in Real-Time Systems

  • Real-time systems often have physical devices feeding them data which must be processed on a deadline. E.g. a CD player receives bits from the laser that reads the disc and must translate them into music quickly to produce the correct sound. An autopilot system must respond to the readings of gauges and meters and adjust the plane’s movement quickly enough to account for current conditions.
  • Hard real-time: Missing a deadline is a disaster.
  • Soft real-time: Occasionally missing a deadline is tolerable.
  • The scheduler must decide how to schedule processes to not miss deadlines. However, processes in real-time systems are usually written to exit very quickly.

2.5.5 Policy vs. Mechanism

  • Sometimes the user processes have information that can be used to make better scheduling decisions. The scheduling mechanism (the actual algorithm) can be passed this information to set the scheduling policy.

2.5.6 Thread Scheduling

  • Varies a lot between user and kernel threads.
  • User-level:
    • Scheduler picks a process, which can run whatever threads it wants whenever it wants for whatever reason it wants.
    • Can end up wasting time running blocked threads
    • Can use a custom thread scheduler and less worry about separating process from mechanism.
  • Kernel-level:
    • The kernel just picks a thread and runs it.
    • Requires a full context switch to swap threads
    • Can choose the highest priority thread regardless of what process it belongs to. So if a single process has two threads that are both higher priority than anything else currently running, they can be run back to back instead of waiting for that process to get control of the CPU again.

N.B. Question 50 is #problematic.