Spark: The Definitive Guide Notes. Part I.

I’ve been wanting to get more involved with data engineering for a while, so I’ve started reading through Spark: The Definitive Guide. These are my notes on Part I, the introductory overview.

Part I. Gentle Overview of Big Data and Spark

Chapter 1. What is Apache Spark?

  • Spark unifies libraries and utilities for various needs for working with large data, e.g. interactive data science, data engineering and ETL.
  • Purely a computation engine; does not store data itself. Integrates with many data stores. This is a major difference from Hadoop, which had its own file system, HDFS, for storing data.
  • Operates on data in memory, moves computations to where the data already is instead of moving the data around.
  • Supports third-party libraries.
  • Big motivator for Spark’s computation model: CPUs stopped getting faster and focus moved to parallelism through multiple cores. However, data storage did keep getting cheaper and faster, so there was more and more data, and the only way to work with it was across multiple machines in paralle.
  • The original team at UC Berkeley who worked on Spark founded a company called Databricks.

Chapter 2. A Gentle Introduction to Spark

  • Spark runs on a cluster, a group of machines. It manages and coordinates execution of tasks on data across the cluster.
  • The cluster can be managed by Spark’s built-in cluster manager; YARN; or Mesos.
  • Spark applications have driver and executor processes.
    • The driver runs the main function from a single node, maintains information about the Spark application, responds to user code, and manages work on executors.
    • The executors just carry out work assigned to them by the executor and report results.
  • Spark driver programs can be written in many languages, which are translated internally into Spark calls for the executors to run:
    • Scala: the implementation language of Spark.
    • Java: through JVM interop, you can write Spark code in Java.
    • Python: PySpark supports nearly all of the Scala API. Runs in an external process and communicates to the JVM process somehow (not explained in the book).
    • SQL: A subset of ANSI SQL 2003.
    • R: through SparkR in Spark core, or the community-contributed sparklyr. Also runs externally and communicates to JVM (somehow).
  • Each language makes a Spark session available to run code.
  • Spark session: sends user commands to Spark. Each Spark application has one and only one Spark session.
  • Spark has two fundamental sets of APIs: lower-level “unstructured” (based around Resilient Distributed Datasets, AKA RDDs) and higher-level “structured” (based around datasets and dataframes).


  • Most common structured API.
  • Represents a table with rows and columns.
  • Has a schema, which can be declared or inferred.
  • One dataframe can span many machines. A section of a dataframe (i.e. a subset of the dataframe’s rows) living on a machine is called a partition.
  • Dataframe partitions are not manipulated individually. If you need to manipulate a partition individually you can use resilient distributed datasets.


  • Spark’s core data structures are immutable, so transformations produce new dataframes.
  • Transformations are lazily evaluated when an action is executed. Until then, they’re just saved—this allows for optimizations, such as predicate pushdown, an optimization where a filter that would be executed later is executed earlier to reduce the amount of data that would need to be transformed.
  • Some transformations are narrow-dependency—each input partition contributes to a single output partition. This means the output data can just stay on the same machine as the input partition.
  • E.g. map(* 2); you can transform each individual row of the input partition into a corresponding row in the output partition without moving the rows between partitions.
  • Others are wide-dependency—each input partition contributes to multiple output partitions. This is also called a shuffle since partitions are broken apart and the pieces are exchanged across the nodes of a cluster.
  • E.g. a sort; if the input data is randomly ordered, you could end up needing to pull apart a partition and send its rows off to multiple other machines to get everything in the right order.
  • Narrow-dependency transformations can be done purely in memory. Wide-dependency writes to disk.


  • Triggers computation of a result from the dataframe by running all the previously applied transformations.
  • Three kinds of action:
    • View data in a console
    • Collect data to a language’s native object types
    • Write data to output sinks

Spark UI

  • Runs on port 4040 of the driver node.

Chapter 3. A Tour of Spark’s Toolset

  • spark-submit lets you run production applications.


  • Lets you bind rows of a dataframe to a class you define instead of a generic row class.
  • Datasets are typesafe; thus, they are not available in the dynamically typed Python and R, only in Java and Scala.

Structured Streaming

  • Can use batch mode operations on streams of input data.
  • The data is incrementally processed.
  • Use readStream to read from a data source as a stream.
  • Actions are slightly different; some actions don’t make sense (e.g. count; you can never get a meaningful count because new data is constantly coming in). You still specify a write sink but it might be an in-memory table or some other source meant for constant incremental updates.

Machine Learning and Advanced Analytics

  • MLLib: built-in library of machine learning algorithms and utilities.
  • Aids in preprocessing, munging, training of models, and making predictions.
  • Every machine learning algorithm has untrained and trained variants (e.g. KMeans is untrained, KMeansModel is trained). You create the untrained version and then train it to get the trained variant.

Lower-Level APIs

  • RDDs (resilient distributed datasets) underlie basically all of Spark.
  • Dataframes are built in RDDs
  • RDDs expose details like individual partitions of data. They are different between Scala and Python, unlike dataframes.
  • You pretty much never need to use them. There are very limited circumstances where you might want to.

Notes on Modern Operating Systems, Chapter 2, Part 3: Scheduling

This is the last part of Chapter 2 of Modern Operating Systems at long last. I never noticed when I was in college how incredibly long the chapters in these textbooks are; most of the chapters in Modern Operating Systems are around a hundred pages. It was less of an issue with Computer Networking because the chapter was like 75% fluff and examples that one could skim over for understanding, but Modern Operating Systems is very text-dense, so breaking the chapter down into parts so I could sustain a sense of momentum and avoid feeling like I was on an endless slog through the same chapter was of great psychological benefit.

2.5 Scheduling

  • Need to choose which process gets the CPU next when it’s free.

2.5.1 Intro to Scheduling

  • Scheduling is much less important on PCs because most of the time is spent waiting for user input, not on CPU.
  • It’s much more important on batch systems, mainframes, and servers.
  • Process switching is expensive so scheduling must also mind the efficiency of CPU use:
    • Must go into kernel mode
    • Then must save all state of the current process
    • And save the memory map (the memory reference bits in the page table)
    • Then select a new process
    • Then reload the MMU with the memory map of the new process
    • Start the new process
    • And also the entire cache will be invalidated, so you have to reload it.
  • Processes alternate bursts of computing with waits on IO (disk, network, user response). Processes with long bursts of CPU usage are called CPU-bound. Those with short bursts of CPU usage are called I/O-bound.
  • These terms are defined in terms of length of CPU burst, not length of I/O wait, because I/O waits don’t really depend on the process; they’re determined by external factors like the speed of the disk or how long it takes a user to respond.
  • As CPUs get faster, processes become more I/O bound.
  • When scheduling happens:
    • When a process starts
    • When a process exits and a new one needs to start
    • When a process blocks
    • When an I/O interrupt happens—can choose to let the formerly blocked process run now that its I/O is available.
  • Clock interrupts provide an opportunity for scheduling to occur. In preemptive systems, the running process will be suspended and another scheduled after some number of clock interrupts. In non-preemptive systems, clock interrupts are not allowed to interrupt the running process.
  • Batch systems will be non-preemptive; there’s no reason to not finish running a job all at once.
  • Interactive systems will be preemptive. E.g. they need to simulate running a word processor and a web browser at the same time, neither of which have a set exit point and both of which might need to accept user input.
  • Real time systems often don’t use preemption; processes are simply written to exit very quickly.
  • Goals of scheduling algorithms:
    • Fairness. Make sure similar processes are treated similarly.
    • Keep CPU and I/O devices busy.
    • In batch systems:
    • Throughput: number of jobs per unit time.
    • Turnaround time: average time to complete jobs
    • In interactive systems:
    • Response time: make sure user gets a response quickly.
    • Proportionality: conform to users’ perception of how long something should take.
    • In real time systems:
    • Hitting deadlines
    • Predictability

2.5.2 Scheduling in Batch Systems

  • I skipped this because I don’t care about batch systems.

2.5.3 Scheduling in Interactive Systems

  • Generally each process gets a quantum, i.e. a unit of time which it is allowed to run for before the clock interrupts it. This must be chosen carefully. Too short means lots of inefficient process switching. Too long means we waste time running processes that are waiting on I/O.
  • Round Robin: just cycle through all the processes.
  • Priority Scheduling: Each process gets a priority. Choose the highest priority process to run. Degrade priorities as processes run to avoid starvation of lower priority processes.
  • Shortest Process Next: Estimate how long each process will take using a weighted average of the time previous runs took. Choose the one predicted to be shortest.
  • Lottery Scheduling: Give each process lottery tickets. At scheduling time, choose a ticket at random and run the process holding that ticket. Good for allocating fixed percentages of the CPU time; to give a process 25% of the CPU, give it 25% of the available tickets.

2.5.4 Scheduling in Real-Time Systems

  • Real-time systems often have physical devices feeding them data which must be processed on a deadline. E.g. a CD player receives bits from the laser that reads the disc and must translate them into music quickly to produce the correct sound. An autopilot system must respond to the readings of gauges and meters and adjust the plane’s movement quickly enough to account for current conditions.
  • Hard real-time: Missing a deadline is a disaster.
  • Soft real-time: Occasionally missing a deadline is tolerable.
  • The scheduler must decide how to schedule processes to not miss deadlines. However, processes in real-time systems are usually written to exit very quickly.

2.5.5 Policy vs. Mechanism

  • Sometimes the user processes have information that can be used to make better scheduling decisions. The scheduling mechanism (the actual algorithm) can be passed this information to set the scheduling policy.

2.5.6 Thread Scheduling

  • Varies a lot between user and kernel threads.
  • User-level:
    • Scheduler picks a process, which can run whatever threads it wants whenever it wants for whatever reason it wants.
    • Can end up wasting time running blocked threads
    • Can use a custom thread scheduler and less worry about separating process from mechanism.
  • Kernel-level:
    • The kernel just picks a thread and runs it.
    • Requires a full context switch to swap threads
    • Can choose the highest priority thread regardless of what process it belongs to. So if a single process has two threads that are both higher priority than anything else currently running, they can be run back to back instead of waiting for that process to get control of the CPU again.

N.B. Question 50 is #problematic.

Notes on Modern Operating Systems, Chapter 2, Part 2: Interprocess Communication

Notes on the second part of Chapter 2 of Modern Operating Systems, on interprocess communication. These sections cover the basic ideas of mutual exclusion and coordination of concurrent processes. There wasn’t a ton in this section I hadn’t seen before, but it was a good refresher and a good chance to nail down some of the terminology.

2.3 Interprocess Communication

Three issues to consider in interprocess communication:

  1. Passing information between processes
  2. Preventing concurrent processes from interfering with each other
  3. Sequencing of dependency acquisition

2) and 3) also apply to threads. 1) less so because threads can share memory.

2.3.1 Race Conditions

  • A race condition occurs when two processes are reading or writing shared data and the result depends on the order of execution.
  • The shared data can be in memory, a file, a directory, whatever.

2.3.2 Critical Regions

  • Need mutual exclusion to lock other processes out of shared data when it’s being used.
  • The part of the program that uses the shared data, requiring mutual exclusion, is called the critical region.
  • If you ensure that two processes are never executing the critical region at the same time, you can avoid race conditions.

Four conditions must hold for a good solution to this:

  1. No two processes are simultaneously in the critical region.
  2. No assumptions may be made about speed of processes or the number of CPUs.
  3. No process outside its critical region should be able to block other processes.
  4. No process should have to wait forever to enter its critical region.

2.3.3 Mutual Exclusion with Busy Waiting

In practice busy waiting is rarely useful, so I’m not going to write down any notes about it, but the book lists several bad solutions for mutual exclusion using busy waiting and goes over how they introduce race conditions and other badness.

A loop that runs doing nothing to implement busy waiting is called a spin lock.

The one useful thing the section covers is the TSL (test-and-set-lock) instruction, which is a CPU instruction that atomically sets and checks a lock variable. This instruction is useful in other, less terrible kinds of mutual exclusion.

2.3.4 Sleep and Wakeup

Want to get rid of busy waiting.

Producer-Consumer Problem

  • We’ll use this problem as an example for the mutual exclusion techniques we’re about to discuss.
  • Two processes share a fixed-size buffer
  • One process produces things and puts them on the buffer.
  • The other consumes things from the buffer and does something with them.
  • Access to the shared buffer must be carefully managed.
  • Conceptually, we want the producer to sleep when the buffer is full and wake up and add things to it when there is space. Conversely, we want the consumer to sleep when the buffer is empty and wake up and take things from it when it is not empty.

2.3.5 Sempahores

  • One way to achieve sleep and wakeup is with semaphores.
  • A semaphore is an integer variable that supports two operations: up and down.
  • down decrements the semaphore. If the semaphore is zero, the process that called down blocks until the semaphore is no longer zero.
  • up increments the semaphore.
  • Semaphores can represent “tickets” or “passes” to do a specific operation.
  • You can use them like this to synchronize two processes, imposing a certain ordering on their actions.
  • For example, in the producer-consumer problem, you can use empty and full semaphores to track how many spaces are empty and full in the buffer.
  • Then you can force the consumer to do a down on full so that it blocks when there’s on the buffer to take, and force the producer to do a down on empty so that it blocks when there’s no space for it to add to the buffer.
  • This assures that consumption only happens after production and production only proceeds to a certain point before some consumption happens.

2.3.6 Mutexes

  • A mutex is another way of using a semaphore. It only has the values 0 and 1.
  • If the value is 1, processes can enter the critical region. If 0, they block on trying to enter the critical region.
  • If your processor has a TSL instruction, you can implement mutex locking and unlocking in user space.
  • Importantly, this means you can implement mutexes so that when a thread blocks on a mutex, the thread can yield the CPU before blocking.
  • Most modern OSes allow limited sharing of memory between processes, which allows mutex sharing.

2.3.7 Monitors

  • A collection of variables and functions that guarantee only a single thread at once can access them.
  • Monitors are a programming language-specific feature. The compiler handles mutual exclusion, often using a mutex or semaphore under the hood.
  • Monitors still need a way to block processes that can’t proceed anymore so that another process can start using the monitor. They do this with…

Condition Variables

  • Two operations: wait and signal.
  • If a process can’t continue, it waits on a condition variable and blocks.
  • Then another process can enter the monitor and wake up the sleeping process by doing a signal on the condition variable.
  • You can either immediately suspend the signaling process, resuming it later, and let the awakened process run; or you can require a signal to be the last thing in a function.
  • Without the monitor’s guarantee of only one process running the code at once, wait and signal are subject to race conditions where signals cross and wakeup signals disappear into the void just as a process is about to go to sleep, causing them to never wake up.

Java objects can also be used as monitors with the synchronized keyword and the wait and notifyAll methods. This combined with threads is the basis for all Java concurrency.

2.3.8 Message Passing

  • Semaphores and monitors require shared memory, so they can’t be used in distributed systems.
  • Message passing, on the other hand, is more general and works even across machines.
  • Uses two operations: send and receive. These are implemented as system calls.
  • send gets a process ID and a message.
  • receive also gets a process ID and in C implementations it gets passed an empty message to hydrate.
  • A receiver can block until a message arrives, or return immediately with an error code.

Message passing raises several design challenges.

  • Messages can be lost by the network. You can have receivers send back an acknowledgment signal, and have senders retry the message if they don’t get it after a certain timeout.
  • Local message passing is slower than semaphores or monitors.
  • You can implement message passing with a mailbox that buffers received messages, or you can have threads block and wait on each other. The latter is called a rendezvous.

2.3.9 Barriers

  • Block a group of processes all at the same point in the program to coordinate their start on the next phase.

2.4 Classical IPC Problems

These are all well-known and there’s a ton of information out there on them, so I didn’t take detailed notes about them.

2.4.1 Dining Philosophers

Models a situation where there is possible contention for a set of shared resources, which can lead to possible deadlocks or starvation.

2.4.2 Readers and Writers

A simplistic model of a database, where multiple concurrent processes can read but only one can write at a time.

2.4.3 Sleeping Barber

Models a situation where a bounded size queue of tasks is being serviced by a single process but filled by multiple processes.

Notes on Modern Operating Systems, Chapter 2, Part 1: Processes and Threads

These are my notes on Chapter 2, Sections 2.1 and 2.2 of Modern Operating Systems, Second Edition, by Andrew Tanenbaum. The chapter is quite long, and dense with text, so I’m splitting my notes into three parts. Part 1, this part, is the basics of processes and threads. Part 2 will be Sections 2.3 and 2.4 on interprocess communication and concurrency control. Part 3 will be Section 2.5 on scheduling.

As with Computer Networking: A Top-Down Approach, I bought an old edition of this book for cheap about six years ago and happened to have it laying around. The second edition is from the year 2000, so it’s now twenty years old and somewhat out of date. But operating systems haven’t changed as much in twenty years as the Internet has, so although some of the chapters talk as if single-core processors are still the status quo, and the Case Studies section in the back talks about Windows 2000, I’ve yet to stumble on anything that makes me suspicious.

The writing style is clear, but the sections aren’t always the best organized and sometimes make references to material that hasn’t been covered yet. Partly that’s just inherent to the way operating system concepts all connect into a whole, so although it does bother me a little, it was a reasonable choice. If you can find a newer edition for under fifty dollars it might be worth picking up. There are a couple sections towards the end of the book where Tanenbaum talks about his work on MINIX and his opinions on Linux and Linus Torvalds that read like aggrieved Usenet posts cleaned up for publication, and he also constantly dunks on other people’s research, writing in Chapter 3 that research on distributed deadlock detection is not “even remotely practical in real systems” and that its “main function seems to be keeping otherwise unemployed graph theorists off the streets”. I find it highly amusing, not least because Tanenbaum must have known that the undergraduate students who would mostly be reading this book didn’t care.

Section 2.1 Processes

  • Processes pretend to run sequentially, but behind they scenes they run discontinuously as the CPU swaps between them.
  • Each process maintains its own logical program counter which is loaded into the real program counter when the CPU picks it up to run.
  • Run order and run time of processes might be non-deterministic, so you cannot make any assumptions about when processes will run or how long they will run for.

2.1.1 Process Creation

Four events cause process creation:

  1. System initialization
  2. System call by a running process (fork on Unix).
  3. User request to create a process (e.g. running a terminal command, clicking an icon).
  4. Batch job initialization (on mainframes, submitted by users).
  • fork on Unix creates new processes.
    • The child process starts with the same memory, environment strings, and open files as its parent.
    • Then it runs execve or some similar system call to change the memory image and run a new program.
    • The child’s memory image is a copy of the parent’s into a new address space. Children do not share address spaces with their parents, so the parent and child cannot access each other’s memory.

2.1.3 Process Destruction

Processes can terminate for the following reasons:

  1. Normal exit (voluntary), triggered on Unix by the exit system call
  2. Error exit (voluntary), e.g. program cannot finish, such as compiler called on nonexistent file
  3. Fatal error (involuntary), e.g. divide by zero error, attempt to access protected memory (null pointer dereference in C)
  4. Murder by another process (involuntary), e.g. someone calls kill on it.

2.1.4 Process Hierarchies

  • Do not exist on Windows.
  • On Unix a process and its descendants form a group. Process groups share some signals and other things.
  • Since all processes on Unix are children of the init process, there is also a global process group.

2.1.5 Process States

Three basic states:

  1. Running (using the CPU)
  2. Ready (stopped, but could be picked up and run at any time)
  3. Blocked (unable to run until some event happens)
  • Running ⬌ blocked can happen due to a system call made by the running process, or automatically if the process reads from a pipe or a special file like a terminal or socket when there’s no input available.
  • Running ⬌ Ready is managed by the process scheduler, which decides what process to run next.

2.1.6 Implementation of Processes

  • The operating system has a process table that stores the following data for each process:
    • Process state (running, ready, blocked)
    • Program counter (i.e. address of the next instruction to execute)
    • Stack pointer (i.e. address of the top of the runtime stack)
    • Memory allocation
    • Status of open files
    • Accounting and scheduling information
    • Anything else that needs to be saved between process switches

The process table enables interrupt execution. (i.e. hardware interrupts, which “interrupt” the currently running process to deliver something from a hardware device)
1. Hardware pushes the program counter, registers, state, etc. to the stack
2. Hardware loads a new program counter from the interrupt vector, a special location in memory assigned to each class of IO device (e.g. hard disk, timer, terminal devices in mainframes).
3. Assembly language procedure saves the current values of the registers in the process table.
4. Assembly language procedure sets up a new stack.
5. Interrupt service written in C runs. It reads input from the IO device and buffers it into memory.
6. Now that the input from the hardware has been pulled in, the scheduler runs and decides which process to run next.
7. The C procedure returns control to assembly
8. Assembly starts the new current process; it loads up the registers, resets the program counter and stack pointer, etc.

2.2 Threads

2.2.1 The Thread Model

  • A process can have multiple threads of execution
  • Processes group resources together; threads are the thing that actually runs on the CPU.
  • Each thread has its own:
    • Program counter
    • Registers
    • Runtime stack with execution history
  • All the threads in a process share:
    • An address space (so they can access each other’s memory)
    • Global variables
    • Open files
    • Child processes
    • Pending alarms
    • Signals and signal handlers
  • All threads in a process share memory so they can read and write each other’s variables and even stacks. This is part of the benefit of threads; it allows more efficient cooperation between concurrent entities (compared to processes, which can’t read each other’s memory and thus must use other means of communicating).
  • Threads have four states:
    1. Running
    2. Blocked
    3. Ready
    4. Terminated
  • Processes start with one thread and create more using library calls.
  • Threads can exit by calling another library procedure, e.g. thread_exit.
  • In some systems threads can wait for other threads to exit by calling thread_wait.
  • thread_yield gives up the CPU for another thread. The CPU won’t actually stop a thread until a process switch occurs, so threads in the same process have to yield to each other.
  • Threads introduce many complications, including synchronizing their shared memory and coordinating the yields so you don’t waste time running blocked threads.

2.2.2 Thread Usage

  • Threads let a single application do multiple things, modeled as sequential jobs running side by side instead of as a mess of interrupts and signals.
  • On multiprocessor systems or multi-core CPUs, multiple threads can actually run simultaneously. On single-core systems threads will switch on and off the one core.
  • Threads can start up faster than processes since they require fewer resources to get going.
  • Threads can block and cooperatively switch off since they can signal each other using their shared memory.

2.2.3 Threads in User Space

  • Processes manage their own threads in user space.
  • The kernel knows nothing of threads.
  • Each process has a thread table with the threads’ program counters, stack pointers, registers, etc.
  • The threads run on a run-time system, a library of thread-managing procedures


  • Thread switching is very fast since applications don’t need to trap to the kernel to switch.
  • Applications can implement custom thread scheduling.
  • User-space threads scale better since they don’t require space in the kernel.


  • If a user-space thread makes a blocking system call, it will block without giving you a chance to switch threads before handing off control to the kernel, because the kernel knows nothing of threads and cannot give you that chance. This means the whole process is blocked.
  • If a thread causes a page fault, e.g. because it needs to run some code that isn’t loaded into memory yet, it gets blocked, again without allowing any chance to switch threads.

Possible solutions to the drawbacks (that all suck):

  • The select system call can detect when a system call will block. You can use it before making a possibly blocking system call, and if it detects that the system call will block, you can switch to another thread and make the system call later. E.g. if the current thread needs to read some data, but the read call will block on a disk read, you can switch threads until the interrupt executes later and the data is buffered in memory, and then switch back to that thread.
  • Using select sucks because you have to wrap all your blocking system calls to do the select check, so you have to modify the kernel, and it’s also annoying and tedious logic to implement.
  • You can also use a clock interrupt to stop the currently running thread every so often and check if it’s been blocked and should be swapped for another thread.
  • Using a clock interrupt sucks because it’s an ugly hack (you might still let the blocked thread have the CPU for a while before the clock interrupt stops it), and apparently each process only gets one clock so if you use it to manage thread swapping you can’t use it for anything else.
  • System calls can all be rewritten to be non-blocking. E.g. rather than waiting for input to be available, a read call would just return 0 (for 0 bytes read) and you could try again later.
  • Rewriting system calls sucks because you have to rewrite a bunch of system calls around threads and that’s not cool.

2.2.4 Threads in the Kernel

  • You can instead put threads in the kernel. The kernel then manages the thread table and thread swapping instead of the user program.
  • All blocking calls are system calls and when a thread blocks, the kernel can choose another thread to run. If there is no suitable thread in the same process, the kernel can choose to let a different process run for a while.
  • Creating, destroying, waiting, etc. are all more expensive since they are system calls.

2.2.5 Hybrid Implementations

  • The idea is to combine the lightweight nature of user-space threads with the ease of swapping out blocked threads that comes with kernel threads.

2.2.6 Scheduler Activations

  • The kernel assigns each process some virtual CPUs (or they can request and release them). [It’s not actually clear to me from the text how the virtual CPUs play into this process.]
  • The kernel can detect when a thread blocks and notify the process through a mechanism called upcalls.
  • Upcalls work by activating the process’s runtime system at a known starting address, similar to a Unix signal. Once activated, the runtime system can choose to run another thread instead of the blocked one.
  • When a hardware interrupt runs, the process will be resumed and the runtime system can decide which thread to run. It can run the thread that was interrupted, or, if the hardware interrupt completed something that a blocked thread cares about (like reading in some data and buffering it in memory), it can choose to run the blocked thread, or it can just choose to run a completely different thread.
  • This system isn’t great because you now have two-way calls between the kernel and user space, whereas a normal operating system has a nice layered model where user space calls the kernel but not the reverse.

To be continued next time with interprocess communication. And then…in the epic final part of Chapter 2…SCHEDULING!!!

Notes on Computer Networking: A Top-Down Approach Chapter 2

These are my notes on Computer Networking: A Top-Down Approach (4th Edition) Chapter 2, Application Layer.

I don’t particularly recommend this book. It came out in 2008, and a lot’s happened to the internet since then. It’s not that well organized; you’ll see if you read the notes that they take some bizarre leaps from topic to topic. And it tends to be extremely wordy; these notes condense almost 100 pages of information. Because of that, you’ll notice that I didn’t take notes on every section. But I happened to have this book laying around, so I read it. I’m putting up the notes for my own future reference and in case someone else someday finds them useful.

2.1.1 Application Architectures

  • Client-server: one process (client) initiates communication, another (server) responds.
  • P2P: No central server, any process can act as either a client or a server.
    (Note: “process” refers to an operating system process, i.e. a running program.)
  • Sockets are the interface between the transport layer (of the OSI layer model) and the application layer. A process listens to a socket for messages and can put messages into the socket to send them across the network to the other end.

2.1.3 Transport Services Available to Applications

  • Messages pushed into a socket are sent across the network using a transport layer protocol.
  • There are four services transport layer protocols could theoretically provide.
    • Reliable data transfer
    • Throughput guarantees (e.g. guaranteed delivery of bits at 50 kbps)
    • Timing (e.g. guaranteed receipt of messages within 100ms)
    • Security

2.1.4 Transport Services on the Internet

  • TCP
    • Uses handshaking to establish a connection between client and server
    • Guarantees lossless data transfer
    • Implements throttling to reduce congestion over the network; this means it provides no timing or throughput guarantees.
  • UDP
    • Does not use handshaking or establish connections; packets sent over UDP might just disappear and never reach their destination.
    • Does not throttle.
    • Theoretically, UDP would be good for loss-tolerant time-bound applications, since the lack of throttling often means more throughput and faster delivery. In practice, UDP is usually blocked by firewalls, so no one uses it.
  • Neither protocol provides throughput guarantees, timing guarantees, or security, and UDP does not provide reliable data transfer either, which makes it puzzling that we took the time to go over those four services above. Applications built on these protocols must cope with the lack of these services.
  • SSL is an application-layer add-on to TCP that provides encryption; other application layer protocols send their messages through SSL, which encrypts them and sends them into the socket, which uses TCP to send them to the other end, where the encrypted message is read from the socket by SSL on the other end and decrypted before being passed to the application layer protocol on that end.
  • The transport layer uses IP addresses and port numbers to find the correct destination machine and socket on that machine.

2.2 The Web and HTTP

  • HTTP is defined by RFCs 1945 and 2616.
  • An HTTP server stores no information about the clients, so HTTP is a stateless protocol.
  • HTTP uses persistent connections by default, i.e. after opening a TCP connection, it will keep that connection open and reuse it for a period of time for subsequent interactions. This increases performance; establishing a connection requires an extra network round trip to go through the handshaking process, and a single web page might require several requests to fully load (to get the HTML, Javascript, CSS, images, etc.), so reusing a connection reduces that overhead.

HTTP request message format:
– \<HTTP method, e.g. GET, POST> \<URL> \<version, e.g. HTTP/1.1>
– <header name>: <header value>
– <More headers>
– Empty line
– <Entity body, which can contain any data>

HTTP response message format:
– <version> <status code, e.g. 400> <explanation of status code, e.g. “Bad Request”>
– <header name>: <header value>
– <More headers>
– Empty line
– <Entity body>

2.2.4 Cookies

  • Defined in RFC 2965
  • Response header Set-cookie tells client to store something, such as a user id, in a cookie file, keyed by the host name.
  • Request header Cookie contains the value set by the host to which the request was made.
  • Cookies allow servers to maintain information about clients even over the stateless HTTP protocol.

2.2.5 Web Caching

  • Cache make go fast
  • The cache can send requests with an If-Modified-Since header; if the object being requested has not been updated, the server will return status 304 Not Modified, telling the cache it can return the version it has.
  • The book doesn’t mention this, but ETags are another mechanism used for cache validation.

2.3 FTP

  • FTP allows navigating a remote filesystem and transferring files to and from that filesystem.
  • FTP uses two TCP connections
    • A control connection on port 21 sends user id, password, and commands
    • A data connection on port 20 sends actual files.
  • Using the separate connection for control information is called sending control information out of band.
  • FTP is not stateless; it stores username, password, and current directory for clients.
  • Control connections are persistent; they are kept open for a period of time and reused.
  • Data connections are not persistent; they are opened by the server in response to user commands, used to send or receive a single file, and closed.

2.4 Email

  • Uses two types of protocols: mail send (SMTP) and mail read (POP3, IMAP).

2.4.1 SMTP

  • Uses TCP’s reliable data transfer.
  • Has both a client and server mode. Message senders are clients, message receivers are servers.
  • All text in an SMTP message must be 7-bit ASCII because the protocol is extremely old.
  • Uses port 25.
  • Handshakes to establish a connection.
  • Uses a single TCP connection to send all outgoing messages it has.
  • Is a push protocol—clients mainly send data to servers. By contrast, HTTP is mainly a pull protocol; clients typically receive data from servers.
  • RFC 822 defines SMTP and the headers for sending ASCII text.
  • MIME (Multipurpose Internet Mail Extensions) defines headers for sending non-ASCII text.
    • Defined in RFCs 2045 and 2046.
    • Uses the Content-type and Content-Transfer-Encoding headers.
    • Content-type tells the receiver what the data actually is, e.g. image/jpeg, audio/vorbis, text/csv.
    • Content-Transfer-Encoding tells the receiver how the data was converted to 7-bit ASCII text, e.g. base64.

2.4.4 Mail-Access Protocols

  • Since SMTP is push-based, it can’t really be used to retrieve mail by non-server clients like desktop PCs.
  • POP3 and IMAP are pull-based protocols for email clients to retrieve mail.
  • POP3 is very simple; it cannot manage email directories on a remote server. It has “read and delete” mode, where it downloads mail and deletes it from the remote server, as well as “read and keep”, where it leaves the mail on the server.
  • IMAP is much more complex.
    • It can manage mail directories.
    • It can obtain parts of messages, which can be useful when bandwidth is low; you can avoid downloading large video and audio files.
  • Web mail clients like Gmail just use HTTP to retrieve messages. POP3 and IMAP are useful for desktop clients like Thunderbird and Outlook, or for terminal clients like Mutt. Gmail and other web-based email services sometimes provide POP3 or IMAP access so you can use clients like Thunderbird and Mutt to read your mail. Information to access Gmail using SMTP and IMAP from an email client.

2.5 DNS

  • DNS (Domain Name System) translates human-readable domain names to numeric IP addresses.
  • The term “DNS” actually refers to two things:
    • A distributed database implemented in a hierarchy of DNS servers.
    • An application-level protocol for querying this database.
  • DNS servers often run the Berkeley Internet Name Domain (BIND) software.
  • DNS runs on UDP and uses port 53.
  • Clients implementing other protocols (e.g. HTTP, FTP, SMTP) will include a step where they use DNS to translate user-supplied hostnames into IP addresses.
  • DNS can add a delay to requests, but the delay is minimized with caching in nearby servers.
  • DNS also provides:
    • Host aliasing: can translate a short alias into a longer canonical hostname
    • Mail server aliasing: can give a web server and mail server the same alias but different canonical hostnames
    • Load distribution: (proceed with caution; I have doubts that this is still a good idea. I asked a question on SE about this: DNS can map a hostname to a set of IP addresses. You can map a hostname to a set of replica servers all running the same application. The DNS server will (according to the book) rotate the order of the IP addresses it returns, and clients will (according to the book) always take the first one, so requests will be distributed across your servers.

2.5.2 How DNS?

DNS uses four types of servers:
– Root DNS servers. There are 13 of these in the world, controlled by various institutions and entities. They are the first stop in a DNS lookup.
– TLD DNS servers. These each service a top-level domain (.com, .edu, .gov, .jp, .fr, .tv, etc.). Different companies own the different TLD servers.
– Authoritative DNS servers. The owner of a domain controls the authoritative DNS server for the domain.
– Local DNS servers. These act as brokers / caching centers for a local area.

A basic example of a DNS interaction.
1. A client requests an internet connection to an ISP. The ISP provides the address of one of its local DNS servers using the DHCP protocol. The local DNS server will be within a few routers of the client for speed.
2. The client will send requests for hostname lookups to its local DNS server. The local DNS server will forward them to a root DNS server.
3. The root DNS server will send back the identity of a TLD server that can handle the request. E.g. if the hostname is, the root DNS server will send back a TLD server that can handle .com.
4. The local DNS server contacts the provided TLD server. The TLD server looks at the entire domain and returns an authoritative DNS server for that domain. E.g. if the hostname is, the TLD server for .com will send back an authoritative DNS server for
5. The local DNS server contacts the provided authoritative DNS server. The authoritative DNS server will provide the IP address of a machine mapped to the hostname, possibly after communicating with other authoritative DNS servers to resolve subdomains. E.g. the authoritative DNS server for might contact another authoritative DNS server to resolve to an IP address.
6. The local DNS server returns the IP address to the client. The client uses that IP address to send a request.

The local DNS server can cache a lot of the information it receives for later reuse. This allows it to skip a lot of the steps. For example if the local DNS server caches the IP of, it can immediately send that IP address back to any client who requests it. It can also cache the authoritative DNS server for, so if a client requests any subdomain of, it can go straight to that authoritative DNS server instead of going through the root and TLD servers again.

DNS cache is busted after a period of time, typically 24–48 hours.

2.5.3 DNS Messages and Resource Records

  • A resource record is a four-tuple: (Name, Value, Type, TTL)
  • The TTL is the time before the record should be evicted from cache.

Four types:
A: Name is host, Value is host’s IP address. Nowadays we also have AAAA for IPv6 addresses.
NS: Name is domain (e.g., Value is hostname of authoritative DNS server for that domain, e.g.
CNAME: Name is an alias, Value is the canonical hostname.
MX: Name is an alias, Value is the canonical hostname of a mail server. This exists so a mail server and another kind of server (such as a web server) can have the same alias but different canonical hostnames, so you can have point to both the web server and the mail server Web clients will request the CNAME record for and get back the value, while mail clients will request the MX record and get back the value

Note1: There are actually an assload of DNS record types.
Note2: On MacOS and Linux (probably also through WSL on Windows) you can use the terminal command dig to do DNS lookups.

Format of a DNS message:

16 bit ID 12 bits of flags
number of questions number of answers
number of authority records number of additional records
  • The 16-bit ID links requests and responses together.
  • The flags indicate such things as whether the message is a query or reply, if the queried server is authoritative for the queried name, or whether to allow recursive queries (a DNS server receiving a query and querying another DNS server to fulfill it).
  • The numbers indicate the number of DNS records returned in the later sections.
  • Questions are queries with host names and types (A for host address, MX for mail server canonical name, etc.)
  • Answers are only in replies and are the answers for the queries (IP addresses for A queries, canonical names for CNAME and MX queries). You might get multiple answers when a hostname has multiple IPs.
  • Authority is the records of other authoritative servers.
  • Additional depends on the query but will contain other useful information. E.g. for an MX query, the answers section will contain the canonical hostname of the mail server, and the additional section will contain the A records mapping the canonical name to an IP address so you don’t have to query again for them. Similarly, for an NS query, the answers section will have the canonical names of the authoritative DNS servers, and the additional section will have the A records for the IP addresses. You can mess around with dig to see these.

Records are inserted to DNS by paying a registrar, which are authorized by ICANN, to get your hostname mapped to your authoritative DNS server in a TLD server for your top-level domain.

There were two more sections, one about P2P protocols and BitTorrent, another about implementing custom application-layer protocols in Java by using the various socket classes. I read the P2P section but chose not to take extensive notes on it. I skimmed the implementation section.

Aphorisms from The Pragmatic Programmer(‘s first half)

I can’t remember the last time I read more than halfway through a tech book. Tech books tend to peak early. You get more than half the value of the first half of the book. A lot of times the second half is specialized or optional topics and a lot of times those topics are out of date unless you buy the book the second it’s off the presses. Of course, I’m also lazy, and about halfway through a tech book I’m usually starting to lose interest and decide I’m okay with building CRUD web apps for the rest of my life if it means I don’t have to read any more of this.

I still haven’t made it all the way through The Pragmatic Programmer either, but I’ve made it further than I usually do—about two-thirds of the way through. And I actually want to read the rest of it. I don’t always like its metaphors or cutesy terminology like “binary chop”, but it’s memorable and easy to read.

The Pragmatic Programmer is structured as a list of 100 tips. Some of them cover code structure and design (“Tip 47: Avoid global data”). Some cover development practices (“Tip 28: Always use version control”). And some relate to personal development (“Tip 11: English is just another programming language”), philosophy (“Tip 3: You have agency”), and ethics (“Tip 99: Don’t enable scumbags”). Each tip cross-references related tips. The end result feels more like reading a bunch of blog posts than a single coherent book. The tips are grouped broadly by topic, but a single tip can go from explaining a technical concept to coding advice to rats chewing through cables. As “pragmatic” suggests, the book imposes no hard lines on its topics and the commentary on the tips goes wherever it needs to go.

In the rest of this post, I’ll go through some of the tips and my idiotic hot takes on them.

Tips 1–4

I tried to read The Pragmatic Programmer for the first time about six years ago. I looked at the Kindle sample, which only had Tips 1–4. I thought, “Wow, these guys are smug as hell”, deleted the sample, and didn’t come back to the book for another six years.

Tips 1–4 are basically the preamble that some college professors put in their syllabi where they lecture you about how you’re an adult now, you need to take responsibility for your own actions, only you benefit from attending this course and finishing your assignments, so on, yada yada, etc. If you’re a bit of a beaver-cleaver, you’ll read these four tips, pat yourself on the back for being so responsible and never making lame excuses, and a deep contentment will warm you from the heart outward for the rest of the day. Otherwise you can just skip to Tip 5. Go on, be a rebel. I won’t tell.

Tip 5: Don’t Live with Broken Windows

When it comes to software rot (or “technical debt”), there is definitely some truth to this idea. If developers feel the code is badly designed and full of hacks, they will contemptuously pile more hacks on top. I’ve seen it happen firsthand. We spent seven months talking about rewriting the entire codebase because the whole thing was such a cancerous dumpster fire that it could never be improved. Then most of the team left because the code was such a cancerous dumpster fire, and about a year after that, management finally gave the go-ahead to rewrite the code.

On the other hand, The Pragmatic Programmer pushes an absolute zero tolerance policy towards broken windows, which has not been realistic in my experience. All the codebases I’ve worked on were written as quickly as possible by people who are long gone. They’re usually bad in various ways. Sometimes in ways that are easy to fix—bad formatting can be fixed by automated tools, and fancy IDEs like IntelliJ IDEA can stick a big glowing flag on certain kinds of anti-patterns and code smells. Sometimes you can go back and patch up the rot and make sure to do better going forward. But sometimes the broken windows are so fundamental or widespread that you can’t fix them without tearing down huge swathes of code and rebuilding deep-seated parts of it. And sometimes you simply aren’t allowed the time to clean up bad code, because your team is tiny and you can’t afford to put a whole developer on a cleanup job for a month unless customers are complaining about it.

However, sometimes, even if you can’t fix the broken windows, you can at least quarantine them so most of your daily work doesn’t touch that code. One codebase I worked on had an absolutely hideous subsystem where a developer wrote all the code in a single god class of 500-line methods that mutated local variables and then returned them to be passed as arguments into other methods that also mutated them. It was horrid, and so brittle that with the original developer gone there was no chance of fixing it without breaking something, but on the bright side, it was all in a single god class, so we could just never look at that class until we had to.

Tip 9: Invest Regularly in Your Knowledge Portfolio

Steve Yegge points out in several of his blog rants that the half life of technical knowledge can vary a lot. So while it might be tempting to study the hip new language or framework that everyone on Hacker News is talking about, it’s a good idea to invest in long term assets as well.

Algorithms, data structures, math, and statistics rarely, if ever, go out of date. Unix tools are practical, have been around forever, and show no sign of going away, so getting familiar with grep and sed can be a great investment. It’ll pay off when you have some weird bug that only happens on the third Thursday of every month and you can bust out a grep command to find all those logs in your log files without even looking at the man pages. find, tr, cut, awk, curl, nc, jq, and screen are also great commands to look into, as well as the shell language itself. Recently I finished a script in four hours that I thought was going to take two days, because I realized I could replace a bunch of API calls and ugly JSON structure rearrangement that I was going to write out in Python with a short bash script using curl and jq. Editors like Vim and Emacs have also been around forever, and learning some of the more powerful commands can save you a ton of time. For a while, I had to debug data pipelines, which usually meant receiving a CSV of data that someone wasn’t satisfied with, taking the UUIDs out of it, and querying a bunch of tables to find out where in the pipeline data got dropped or corrupted. cut and Emacs saved me a ton of time; I could use cut to extract the UUIDs from the CSV, then paste it in Emacs and use replace-regexp to quote them and reformat them into a comma-separated list that I could just paste into an IN clause of an SQL query. Which brings me to SQL—it’s been around since the 70’s, and unlike disco, it doesn’t seem to be going anywhere, so you could do worse than to study it. So many of the weird bugs ORMs cause only become clear once you understand the SQL it must be generating.

The Pragmatic Programmer also suggests investing time in some of the weirder languages to have a bigger stock of ideas to draw on. I think this idea bears repeating. I see too many people online who know Python, Ruby, and Javascript and want to have a language war over which of those three is better. I also see people who know Python, Ruby, and Javascript complain that Java is ancient, inscrutable gibberish that somehow is also the most widely used language in the world. I was this person. When I first started learning to program, I wrote a huge rant (that I thankfully never published) comparing Visual Basic, Java, and Python. My arguments were that Visual Basic was better than Java because you have to write End If, so you know it ends an If, whereas in Java it’s always just a brace, but Python was the best because whitespace. I was stupid. What I was doing was equivalent to reviewing three books which all cover the same material in the same order and only comparing what fonts the text was printed in. I definitely support spending time with Icon, Prolog, OCaml, Io, or Idris, and learning just how different a language can be, before discussing differences between languages.

Tip 10: Critically Analyze What You Read and Hear

I majored in computer science, but before that, I was an English major, and before that, I spent ten years wanting to be an English major. One of the most important things you learn studying humanities is exactly what this tip says—how to critically analyze what you read and hear.

When you’re evaluating a new technology, be critical about the claims. Don’t just accept flashy benchmarks; probe the methodology behind those flashy benchmarks. Think about the context behind claims: maybe a group at Amazon is building great things with an experimental JIT compiler for Lua, but does that mean your team will be able to do the same? Maybe the software actually isn’t business critical and only has five users. Maybe the creator of the experimental JIT compiler is on the team and can fix any bugs they run into. Maybe Amazon, being Amazon, can make it worth the creator’s while to solve their problems.

Being critical doesn’t mean being constantly and unreservedly cynical. You can be positive about things. You can love Rust if you want. But you should love Rust because you critically analyzed the claims made about it and decided they were reasonably accurate, and you should always be on the lookout for new information that contradicts your current point of view. When you find that new information, critically analyze it and decide if it might be true or not. Then decide if you need to care. Even if your favorite language chokes on 5 HTTP requests per second, you can still enjoy hacking together a hobby project with it. But be honest; don’t convince your company to build something in your favorite language when there’s a strong chance the system will need to handle 1,000 HTTP requests per second.

Tip 14: Good Design is Easier to Change Than Bad Design

This is true, but no design is infinitely flexible. You always have to make some choices that will lock you in. I’ve seen (and written) a ton of code that was easy to extend in a way that turned out to never be necessary. I’ve also seen (and written) a ton of code that was built to be extensible, but not quite extensible enough to capture a later use case. This is where the tips about making your design orthogonal and decoupled come in. If you can rip out a whole module and replace it with something else, then you can change it, even if it’s not as easy to change as it would be if you’d thought of something when you first designed it.

Tip 17: Eliminate Effects between Unrelated Things

This is the larger principal behind a ton of well-known best practices. Don’t use global variables, because they provide a channel for two unrelated things to affect each other. The C preprocessor is a pain because unrelated files can affect each other if they happen to be compiled together. I also find a lot of semi-common practices in hyper-dynamic languages like Ruby, Javascript, and Python to have a lot of potential to create effects between unrelated things. Javascript didn’t used to have real modules, so libraries would shove things in the global namespace. It was a complete nightmare to deal with libraries fighting for symbol real estate in the window object. (JS for some reason allows just $ as a function name, so everybody fought over $ until jQuery came in and definitively staked claim to it.) I’ve had some infuriating bugs due to Ruby’s ability to define a class across several files combined with Rails’s implicit global imports. In Ruby, if you define a class with the same name in different files and load both of those files, they will be combined into a single class. Rails will automatically load every file you put in its blessed directories. So I’ve encountered ridiculous bugs where two programmers made classes with the same name in different contexts, and Rails helpfully imported both files, causing Ruby to combine those two unrelated classes into a single class and breaking code in two places that had nothing to do with each other.

Tip 19: Forgo Following Fads

This tip is pretty useless. The idea is good, but it offers no guidance on figuring out what’s a fad and should be ignored. I’ll offer a pointer back to Tip 10, though: critically analyze what you read and hear.

Tip 20: Use Tracer Bullets to Find the Target

This tip is awesome and the book is worth the cover price just for this.

The basic idea is that when you start a new system, you build an end-to-end skeleton of the entire architecture with some example flows. Stub out anything you have to, just get the end-to-end working as quickly as possible. This will show you what the architecture you have planned will look like in practice. It’s called a “tracer bullet” because it lets you see where you’re aiming, and you can assess how close your shot is to the target.

The book suggests showing your skeleton to customers. This probably depends on your organization’s culture and relationship to customers. Nothing I’ve worked on benefited from being shown to customers. The customers would get hung up on details like the color of the banner on mocked up web pages, or they would seize on random technical details that they happened to know about (“It won’t allow escalation of privilege, right?”, “The JSON responses are small and streamlined, right? That makes it fast.”, “Did you use Angular? Angular makes rich interfaces.”), or they would look at it, grunt, and go back to whatever they were doing. But it can be a great benefit to show it to engineering leadership, or to engineers on other teams, or product managers or UX experts. And if you happen to have customers able to give useful feedback, sure, show it to them.

Tip 32: Read the Damn Error Message

Part of this tip discusses rubber ducking, which is where you explain your problem to a rubber duck because your problems are idiotic and should not be inflicted on other humans. Because your problem is so idiotic, the rubber duck’s silence will make you feel shame, causing you to quit the industry and take up competitive bass fishing.

In case you don’t want to quit the industry yet, I’ve found that a more productive strategy than talking to a duck is writing out what the buggy code is doing as if you’re manually running through something you’ve written for an interview. If you don’t know why something happens, note that and the result and move on. Start a little before the bug, and keep going to a little bit after—the moment where the exception is thrown or the erroneous result returned is usually a good place. For me this usually causes something to click eventually, and I see what I’m missing. It can also help, once you’ve isolated the buggy code and the bad result it outputs, to treat it like a brainstorming session—what are five ways this bad result could have been created? Get completely crazy; “solar wind knocking an electron out of alignment, causing the bits to swap” is on the table. How do those five ways relate to the surrounding code? What else is going on in the program at the same time this code is executing?

Tip 37: Design with Contracts

I’ve come around a little on statically typed languages. Java is still kinda painful to write, but at least it’s also kinda painful to write for that guy who was three months out of university when he was asked to write an entire analytics tool in a week and didn’t have time to write comments or real error handling—his code is probably a mess, but at least Java forced him to leave a few more breadcrumbs about what the hell he was doing than Python or Javascript would have. And newer languages like Go and Rust reduce a lot of the annoying parts of Java by adding better type inference and other goodies.

Statically typed languages force you to write basic contracts for every function. It’s nicer to write contracts that express some real invariant about your data—“x represents an age, so it must be between 0 and 200″—but at least “x is an integer” is more than nothing.

But contracts are most important for system boundaries, where different people’s code comes together, and nowadays that often happens over a network. So that’s where statically typed data transfer formats, like protocol buffers, become useful.


There’s lot of good advice in The Pragmatic Programmer, and most of the tips I didn’t bother to discuss because I just nodded when I read them. But there’s also some stuff implicit in the book’s worldview that I don’t agree with—I have a whole rant about “people are not their code” that I might go into when I finish up the book. I recommend you read it, but that you keep in mind Tip 10. Don’t be a beaver-cleaver. Be a cool rebel who studies humanities. Donne is subversive, man.

Why Putting JSON in a Database Column is a Bad Idea. Or, Why Would You Ever Put JSON in a Database Column?

When I was a beginner at databases, I was tempted to put JSON in a string column. It’s a weirdly common thing for beginners to relational databases to want to do. A lot of tutorials aimed at beginners suggest doing it for some reason. When people show up on Stack Overflow wanting to know why their code isn’t working, and that code happens to be putting JSON in a relational database column, sometimes someone will show up and tell them to quit doing that, but it’s only like a 50/50 shot, whereas pretty much every other questionable beginner practice will attract hordes of smartasses, some of them barely more than beginners themselves, who will strongly and sometimes stridently censure the beginner for contemplating it. I attribute the popularity of Mongo and other NoSQL databases in part to this instinct: Mongo is basically a database with nothing but string columns that have JSON in them, and some search functionality for JSON in those string columns.

If you’re sometimes tempted to put JSON in a string column, I hope I can explain today why it’s a bad idea and you shouldn’t do it, in a way that will make sense. On the other hand, maybe you’re one of those relational database savants who understood third normal form right away, and you don’t understand why anyone would ever want to put JSON in a string column instead of having a real schema. If so, I hope I can explain to you why someone might want to put JSON in a string column.

Strong Static Typing, Relational Data Modeling, and Escape Hatches

In “Is Weak Typing Strong Enough?”, Steve Yegge describes (buried somewhere in the point he was actually making about programming languages) how teams that he worked with at Amazon were too boxed in by the strong statically typed strict schema the relational data models imposed on them, and resorted to tactics like an untyped name / value system and passing an XML parameter as a string to a CORBA interface. (CORBA was a standard for an early style of distributed services that let you treat objects running on remote servers as if they were in the same address space and could call each others’ methods. It was also language agnostic, so you could run Java on one server and APL on another server and they could use their CORBA implementations to call each others methods. At least, that’s what Wikipedia said; CORBA predates my programming experience by some years.)

The point here is that relational database models are a sort of strong static typing. No matter how much you love strong static typing, sometimes it’s not flexible enough, and you need escape hatches. In 2005 when Steve Yegge wrote his piece, it was XML strings through CORBA. On a programming language level, it’s things like downcasting from Object in Java or using the dynamic keyword in C#. Storing JSON in a string column is another one of these escape hatches. What you get is flexibility. Let’s keep this in mind as we go into our big example, which I hope will demonstrate both why you shouldn’t put JSON in string columns and why you might want to sometimes.

Hugs and Warm Fuzzies: An Example

This example is loosely based on a real system I worked on that had JSON in string columns.

Let’s say we’re working on an MMORPG called Crayon Art Online. Unlike most MMORPGs, which are all about killing things, Crayon Art Online is all about love and friendship and positivity.

Every few hours, one of our servers needs to kick off a job that will read a bunch of player actions from the game’s main database and calculate metrics on them, storing them in a separate metrics database. Since the game is all about doing friendly, happy things, the actions will be things like hugs_given, crayon_drawings_gifted, lunches_bought_for_others, pep_talks, and positive_affirmations. There will be some actions that have extra associated metrics, like hug_warmth and affirmation_positivity and pep_talk_peppiness. There will be some conditions where we just check for a Boolean answer, like sculpting_class_taken?, which just has a true or false answer. We need to calculate all these metrics for all the players across different periods of time: the past 30 days, the past 90 days, the past year, and the entire period the player has existed. Then we need to store them in a database so another job that runs later can read them and distribute prizes to the players that reach certain goals, and messages of encouragement and hope to the others.

Let’s go into some aspects of our data that we’ll have to model. We have a ton of different actions–let’s say there are about 25 total. It’s not a fixed set; there will be new metrics if the developers add new actions to the game, and they might also remove actions, so we would lose the metrics associated with those. That means making a separate table for each metric is doable, but a huge pain. Every time we add a new metric, we have to add a new table. Reading all the metrics for a player means querying 25 tables. A simple manual check for data consistency between the main game’s database and ours requires us to write a select command that includes 25 tables. It seems there must be an easier way.

However, one table also clearly won’t work unless we store all the values as strings. We need a value column to be able to store integers (hugs_given), floats (average_positivity), Booleans (is_wearing_floppy_hat?), and strings (floppy_hat_color). So another option would be one table per value type: int_valued_metrics, real_valued_metrics, boolean_valued_metrics, and string_valued_metrics. But that’s pretty tacky. Your data model is supposed to model the data, like it says in the name. The data itself doesn’t naturally divide into these different types; they’re all just metrics, with the same fundamental features: a name, a time period, a user, and a value. Our database engine needing to store the value as an integer or float or whatever is an implementation detail, not an actual part of the domain, and it feels unnatural to spread the data model for the single metric concept across four tables (and counting, if you add types later—maybe at some point you have tuple-valued metrics, or BLOB-valued). This is better than 25 tables, and it’s better than one table, but it’s still not good.

At this point, things look pretty dire. Relational data modeling has failed us. But wait, you realize. Every programming language nowadays, well except for Java anyway, has a powerful JSON parsing engine built in that can turn a string of JSON into a native map, with type conversions and everything! So, why don’t we just store metrics as a string of JSON per user?

Strings of JSON have lots of attractive qualities. They’re easy to understand; JSON is dead simple, by construction, whereas SQL and relational data models are complicated and hard to understand. Strings of JSON are easy to parse into whatever kind of map or hash thingy your language supports (unless you’re using Java or Scala), and once you’ve done that, you know exactly what you can do with them. Reading from a relational database usually requires some annoying library that binds the data to native data types, or else you can make raw SQL calls, get handed back the most useless, primitive thing your language supports, and do the conversion yourself. JSON is dynamically typed, so you don’t have to worry about finding the correct data type; just store the string, and when you read it back the parser will figure out what types everything should be. JSON is extremely flexible, letting you store anything from a simple glob of key-value pairs to a bramble of arbitrarily nested lists and maps, whereas with relational data models you’re stuck with two dimensional tables.

So, let’s just forget all this relational modeling stuff and make a table that maps a user id to a string of JSON that looks like this:

    “hugs_given”: {“30_days”: 5, “90_days”: 10, “1_year”: 85, “all_time”: 345},
    “has_taken_respect_seminar?”: true,
    // ...

Now it’s easy to look up a user’s stats. Just select the_json_string from metrics where user_id = 394. Now you have a JSON string with all the metrics. Now, whenever you add a new metric, you just have your code calculate it and stick it in the JSON object before writing it back as a string. No schema changes, no data model cabal to get through, just code sticking things in JSON. If you decide integer-valued stats should also support a count for the last 42 days, just calculate it and stick it in everyone’s JSON.

But here’s a problem: suppose you want to find all the users with more than 26 hugs given in the last 30 days. How do you query this data? SQL can’t read JSON, so you can’t do select * from metrics where the_json_string.hugs_given.30_days > 26. That JSON type is opaque to the relational database. (Unless you’re using PostgreSQL. We’ll pretend you’re not. The real project I worked on used MySQL, but I believe Oracle and MS SQL Server also lack a JSON type.) So how do you write this query? You can write your own terrible ad hoc JSON parser using regexes and LIKE right there in your database console, or you can write a script in some language like Python or Ruby that reads in the data, parses the JSON string, and does the check for you. Writing this script can also get surprisingly tricky. If the entire table doesn’t fit in memory, you can’t just read all the rows and loop over them. The simple approach would be to get the maximum id of the table and then select each row by id, skipping over misses, until you’re done. But this incurs a disk read on each iteration of the loop, plus possibly a network call if you’re not allowed to run the script right on the database server, so it’ll be agonizingly slow. So you’ll probably read in batches. Now you have to mess with the logic around batch sizes and offsets. It probably won’t take more than an hour or so to write this script. Still, a simple select is something you can dash off about as fast as you can type it in. And even if you write the script generically enough that you now have a general purpose script for querying based on values in the JSON string, you still don’t have all the tools available in one place. If you’re working in the database console and suddenly realize you need to query some data in the JSON string to proceed, you have to stop, go to another window, and run the script. There’s extra friction if you need to feed that data into an SQL query at the database console to move on. There’s this gaping hole where you can’t use SQL to find data that’s stored in your SQL database.

Putting data in strings also kills the validation benefits that a statically typed data model gives you. If someone accidentally writes {"hugs_given": "fish"} to the database, the database doesn’t know that that’s not how you measure hugs. If someone accidentally writes {"hugs_given": {"30_day": 5, "30_day": 10} to the database, the database doesn’t know or care that you repeated a key.

How to Proceed?

So now we know both the strengths and weaknesses of storing JSON in string columns. It’s flexible, easy to change, easy to write, easy to understand, easy to work with in code. On the other hand, it’s meaningless to the SQL database (unless you’re fortunate enough to be working in Postgres). It’s just this string of characters that might as well be the complete text of the classic novel Lady Chatterley’s Lover by DH Lawrence. You can’t query it from SQL in any meaningful way. You can’t put indexes on it. (Well, you could put an index, but it would be massive and not very helpful.) You lose the data validation benefits that a relational schema gives you. What do we do?

It’s possible that you feel you can live with the limitations of JSON strings in database columns. Maybe you can. I have a feeling they’ll eventually get in your way, though. They certainly got in mine.

If you have the option of switching data stores, you could just use Postgres, or some JSON-oriented document store like MongoDB, although I have reservations about that in general. In Crayon Art Online, as in the real application that inspired it, metrics are calculated by a separate process and stored in a different database than the main application, so even if your main application is on MySQL, it might be doable to switch your metrics application to Postgres. (It wasn’t for me, for a variety of boring non-technical reasons.)

But suppose you’re convinced that storing an opaque string of JSON won’t work for you, and you don’t have the option of switching data stores. You’re left with trying to find a better data model to fit this into a relational schema. In this particular case, where the nesting isn’t that deep, it’s not actually that hard to come up with an acceptable solution that’s better than JSON string in a column.

To start with, I’d represent each individual calculated metric as a row with a user id, type tag (hugs_given, positivity_quotient, etc.), a date range tag (30_day, 90_day, etc.), a value, and a created_at date to record when we calculated it.

In my experience, rolling date ranges are a lot simpler than actual dates, so even though it might be tempting to get rid of the string keys for 30_day, 60_day etc. and come up with some way of storing the date range based on actual dates, it could also complicate things a lot. If you need to know the actual date range this covers, you can count backwards from the created_at. If you expect to need this a lot, it might make sense to store the date keys as integers that represent the number of days covered by this metric, so we’d have a column days_covered, and then we can find the start date of this period by subtracting that from the created_at.

The main hurdle with representing this in a schema is the fact that the value type can be an integer, real number, Boolean, string, or possibly other things. Making separate tables for each value type is tacky in my opinion, but you could go this route if you want. Then you could use some kind of wrapper at the data access layer to hide the four classes from your code and just present one consistent interface for them. An even more tacky variant would be to make your value column store a foreign key into another table, and make four value tables that hold your values. So you’d have the five tables metric, int_value, boolean_value, real_value, and string_value, and metric would have a type column that tells it which table to look for its values in, and a value_id column that points into one of the value tables. Not only does this have the same disadvantages as the four metrics tables, it’s also weird and complicated to understand. Another option: store the value, and only the value, as a string, and implement some kind of conversion to the correct type at the data access layer. You can still run queries on the numeric values by casting the string to a number, so you’re not as out of luck as you were with opaque JSON strings. None of these are great solutions. This was the original hurdle that led us to putting JSON in a text column, and I don’t know of any really good way out of it. But I do think there are better ways than JSON in text columns.

This is not a perfect, beautiful schema. It’s not a work of art. It’s not going to get written up in textbooks or research papers as amazing. But it is a real relational database schema which will work better for you in the long run than shoving a string of JSON into a column. (Or a string of CSV, or XML, or YAML, or any other kind of structured text format that isn’t supported natively by the database type system.) You can do queries on it, you can analyze it, you can put indexes on it, you can implement certain data consistency standards on it (e.g. a unique key on type, days_covered, and created_at that prevents you from accidentally inserting the same start for the same date range more than once). There are also opportunities to make things better as you discover problems. With text in a column, there’s not a lot else you can do with it when you discover problems.

This schema is also more flexible and manageable than separate tables for each metric. If you need to add a new stat, just start calculating it and add a new tag for it. If you want to support a new period of time, just make a tag for it and start storing it. No need for schema changes.

The Takeaway

If you came into this as an expert who loves relational databases and strongly typed schemas, I hope you can understand a little more the kind of problems that seem too hard to fit into the strictures of such a schema, driving beginners to want to put JSON in string columns.

If you came into this as someone who’s occasionally—or more than occasionally—thought the answer to all your problems was shoving JSON in a text column, I hope you now understand why that’s not really the answer to your problems. Or you learned that you need to switch to another data store that supports shoving JSON into text columns better, and now every company you work at, you’ll lobby to switch to Postgres or Mongo. Someday that might even lead you to a job as a MongoDB ambassador who gets to travel around to conferences telling everyone why shoving JSON in text columns is terrible, and they should instead shove JSON in MongoDB to solve all their problems. Just think; I could be starting careers here today.