Spark: The Definitive Guide Notes. Part I.

I’ve been wanting to get more involved with data engineering for a while, so I’ve started reading through Spark: The Definitive Guide. These are my notes on Part I, the introductory overview.

Part I. Gentle Overview of Big Data and Spark

Chapter 1. What is Apache Spark?

  • Spark unifies libraries and utilities for various needs for working with large data, e.g. interactive data science, data engineering and ETL.
  • Purely a computation engine; does not store data itself. Integrates with many data stores. This is a major difference from Hadoop, which had its own file system, HDFS, for storing data.
  • Operates on data in memory, moves computations to where the data already is instead of moving the data around.
  • Supports third-party libraries.
  • Big motivator for Spark’s computation model: CPUs stopped getting faster and focus moved to parallelism through multiple cores. However, data storage did keep getting cheaper and faster, so there was more and more data, and the only way to work with it was across multiple machines in paralle.
  • The original team at UC Berkeley who worked on Spark founded a company called Databricks.

Chapter 2. A Gentle Introduction to Spark

  • Spark runs on a cluster, a group of machines. It manages and coordinates execution of tasks on data across the cluster.
  • The cluster can be managed by Spark’s built-in cluster manager; YARN; or Mesos.
  • Spark applications have driver and executor processes.
    • The driver runs the main function from a single node, maintains information about the Spark application, responds to user code, and manages work on executors.
    • The executors just carry out work assigned to them by the executor and report results.
  • Spark driver programs can be written in many languages, which are translated internally into Spark calls for the executors to run:
    • Scala: the implementation language of Spark.
    • Java: through JVM interop, you can write Spark code in Java.
    • Python: PySpark supports nearly all of the Scala API. Runs in an external process and communicates to the JVM process somehow (not explained in the book).
    • SQL: A subset of ANSI SQL 2003.
    • R: through SparkR in Spark core, or the community-contributed sparklyr. Also runs externally and communicates to JVM (somehow).
  • Each language makes a Spark session available to run code.
  • Spark session: sends user commands to Spark. Each Spark application has one and only one Spark session.
  • Spark has two fundamental sets of APIs: lower-level “unstructured” (based around Resilient Distributed Datasets, AKA RDDs) and higher-level “structured” (based around datasets and dataframes).


  • Most common structured API.
  • Represents a table with rows and columns.
  • Has a schema, which can be declared or inferred.
  • One dataframe can span many machines. A section of a dataframe (i.e. a subset of the dataframe’s rows) living on a machine is called a partition.
  • Dataframe partitions are not manipulated individually. If you need to manipulate a partition individually you can use resilient distributed datasets.


  • Spark’s core data structures are immutable, so transformations produce new dataframes.
  • Transformations are lazily evaluated when an action is executed. Until then, they’re just saved—this allows for optimizations, such as predicate pushdown, an optimization where a filter that would be executed later is executed earlier to reduce the amount of data that would need to be transformed.
  • Some transformations are narrow-dependency—each input partition contributes to a single output partition. This means the output data can just stay on the same machine as the input partition.
  • E.g. map(* 2); you can transform each individual row of the input partition into a corresponding row in the output partition without moving the rows between partitions.
  • Others are wide-dependency—each input partition contributes to multiple output partitions. This is also called a shuffle since partitions are broken apart and the pieces are exchanged across the nodes of a cluster.
  • E.g. a sort; if the input data is randomly ordered, you could end up needing to pull apart a partition and send its rows off to multiple other machines to get everything in the right order.
  • Narrow-dependency transformations can be done purely in memory. Wide-dependency writes to disk.


  • Triggers computation of a result from the dataframe by running all the previously applied transformations.
  • Three kinds of action:
    • View data in a console
    • Collect data to a language’s native object types
    • Write data to output sinks

Spark UI

  • Runs on port 4040 of the driver node.

Chapter 3. A Tour of Spark’s Toolset

  • spark-submit lets you run production applications.


  • Lets you bind rows of a dataframe to a class you define instead of a generic row class.
  • Datasets are typesafe; thus, they are not available in the dynamically typed Python and R, only in Java and Scala.

Structured Streaming

  • Can use batch mode operations on streams of input data.
  • The data is incrementally processed.
  • Use readStream to read from a data source as a stream.
  • Actions are slightly different; some actions don’t make sense (e.g. count; you can never get a meaningful count because new data is constantly coming in). You still specify a write sink but it might be an in-memory table or some other source meant for constant incremental updates.

Machine Learning and Advanced Analytics

  • MLLib: built-in library of machine learning algorithms and utilities.
  • Aids in preprocessing, munging, training of models, and making predictions.
  • Every machine learning algorithm has untrained and trained variants (e.g. KMeans is untrained, KMeansModel is trained). You create the untrained version and then train it to get the trained variant.

Lower-Level APIs

  • RDDs (resilient distributed datasets) underlie basically all of Spark.
  • Dataframes are built in RDDs
  • RDDs expose details like individual partitions of data. They are different between Scala and Python, unlike dataframes.
  • You pretty much never need to use them. There are very limited circumstances where you might want to.

Notes on Modern Operating Systems, Chapter 2, Part 3: Scheduling

This is the last part of Chapter 2 of Modern Operating Systems at long last. I never noticed when I was in college how incredibly long the chapters in these textbooks are; most of the chapters in Modern Operating Systems are around a hundred pages. It was less of an issue with Computer Networking because the chapter was like 75% fluff and examples that one could skim over for understanding, but Modern Operating Systems is very text-dense, so breaking the chapter down into parts so I could sustain a sense of momentum and avoid feeling like I was on an endless slog through the same chapter was of great psychological benefit.

2.5 Scheduling

  • Need to choose which process gets the CPU next when it’s free.

2.5.1 Intro to Scheduling

  • Scheduling is much less important on PCs because most of the time is spent waiting for user input, not on CPU.
  • It’s much more important on batch systems, mainframes, and servers.
  • Process switching is expensive so scheduling must also mind the efficiency of CPU use:
    • Must go into kernel mode
    • Then must save all state of the current process
    • And save the memory map (the memory reference bits in the page table)
    • Then select a new process
    • Then reload the MMU with the memory map of the new process
    • Start the new process
    • And also the entire cache will be invalidated, so you have to reload it.
  • Processes alternate bursts of computing with waits on IO (disk, network, user response). Processes with long bursts of CPU usage are called CPU-bound. Those with short bursts of CPU usage are called I/O-bound.
  • These terms are defined in terms of length of CPU burst, not length of I/O wait, because I/O waits don’t really depend on the process; they’re determined by external factors like the speed of the disk or how long it takes a user to respond.
  • As CPUs get faster, processes become more I/O bound.
  • When scheduling happens:
    • When a process starts
    • When a process exits and a new one needs to start
    • When a process blocks
    • When an I/O interrupt happens—can choose to let the formerly blocked process run now that its I/O is available.
  • Clock interrupts provide an opportunity for scheduling to occur. In preemptive systems, the running process will be suspended and another scheduled after some number of clock interrupts. In non-preemptive systems, clock interrupts are not allowed to interrupt the running process.
  • Batch systems will be non-preemptive; there’s no reason to not finish running a job all at once.
  • Interactive systems will be preemptive. E.g. they need to simulate running a word processor and a web browser at the same time, neither of which have a set exit point and both of which might need to accept user input.
  • Real time systems often don’t use preemption; processes are simply written to exit very quickly.
  • Goals of scheduling algorithms:
    • Fairness. Make sure similar processes are treated similarly.
    • Keep CPU and I/O devices busy.
    • In batch systems:
    • Throughput: number of jobs per unit time.
    • Turnaround time: average time to complete jobs
    • In interactive systems:
    • Response time: make sure user gets a response quickly.
    • Proportionality: conform to users’ perception of how long something should take.
    • In real time systems:
    • Hitting deadlines
    • Predictability

2.5.2 Scheduling in Batch Systems

  • I skipped this because I don’t care about batch systems.

2.5.3 Scheduling in Interactive Systems

  • Generally each process gets a quantum, i.e. a unit of time which it is allowed to run for before the clock interrupts it. This must be chosen carefully. Too short means lots of inefficient process switching. Too long means we waste time running processes that are waiting on I/O.
  • Round Robin: just cycle through all the processes.
  • Priority Scheduling: Each process gets a priority. Choose the highest priority process to run. Degrade priorities as processes run to avoid starvation of lower priority processes.
  • Shortest Process Next: Estimate how long each process will take using a weighted average of the time previous runs took. Choose the one predicted to be shortest.
  • Lottery Scheduling: Give each process lottery tickets. At scheduling time, choose a ticket at random and run the process holding that ticket. Good for allocating fixed percentages of the CPU time; to give a process 25% of the CPU, give it 25% of the available tickets.

2.5.4 Scheduling in Real-Time Systems

  • Real-time systems often have physical devices feeding them data which must be processed on a deadline. E.g. a CD player receives bits from the laser that reads the disc and must translate them into music quickly to produce the correct sound. An autopilot system must respond to the readings of gauges and meters and adjust the plane’s movement quickly enough to account for current conditions.
  • Hard real-time: Missing a deadline is a disaster.
  • Soft real-time: Occasionally missing a deadline is tolerable.
  • The scheduler must decide how to schedule processes to not miss deadlines. However, processes in real-time systems are usually written to exit very quickly.

2.5.5 Policy vs. Mechanism

  • Sometimes the user processes have information that can be used to make better scheduling decisions. The scheduling mechanism (the actual algorithm) can be passed this information to set the scheduling policy.

2.5.6 Thread Scheduling

  • Varies a lot between user and kernel threads.
  • User-level:
    • Scheduler picks a process, which can run whatever threads it wants whenever it wants for whatever reason it wants.
    • Can end up wasting time running blocked threads
    • Can use a custom thread scheduler and less worry about separating process from mechanism.
  • Kernel-level:
    • The kernel just picks a thread and runs it.
    • Requires a full context switch to swap threads
    • Can choose the highest priority thread regardless of what process it belongs to. So if a single process has two threads that are both higher priority than anything else currently running, they can be run back to back instead of waiting for that process to get control of the CPU again.

N.B. Question 50 is #problematic.

Notes on Modern Operating Systems, Chapter 2, Part 2: Interprocess Communication

Notes on the second part of Chapter 2 of Modern Operating Systems, on interprocess communication. These sections cover the basic ideas of mutual exclusion and coordination of concurrent processes. There wasn’t a ton in this section I hadn’t seen before, but it was a good refresher and a good chance to nail down some of the terminology.

2.3 Interprocess Communication

Three issues to consider in interprocess communication:

  1. Passing information between processes
  2. Preventing concurrent processes from interfering with each other
  3. Sequencing of dependency acquisition

2) and 3) also apply to threads. 1) less so because threads can share memory.

2.3.1 Race Conditions

  • A race condition occurs when two processes are reading or writing shared data and the result depends on the order of execution.
  • The shared data can be in memory, a file, a directory, whatever.

2.3.2 Critical Regions

  • Need mutual exclusion to lock other processes out of shared data when it’s being used.
  • The part of the program that uses the shared data, requiring mutual exclusion, is called the critical region.
  • If you ensure that two processes are never executing the critical region at the same time, you can avoid race conditions.

Four conditions must hold for a good solution to this:

  1. No two processes are simultaneously in the critical region.
  2. No assumptions may be made about speed of processes or the number of CPUs.
  3. No process outside its critical region should be able to block other processes.
  4. No process should have to wait forever to enter its critical region.

2.3.3 Mutual Exclusion with Busy Waiting

In practice busy waiting is rarely useful, so I’m not going to write down any notes about it, but the book lists several bad solutions for mutual exclusion using busy waiting and goes over how they introduce race conditions and other badness.

A loop that runs doing nothing to implement busy waiting is called a spin lock.

The one useful thing the section covers is the TSL (test-and-set-lock) instruction, which is a CPU instruction that atomically sets and checks a lock variable. This instruction is useful in other, less terrible kinds of mutual exclusion.

2.3.4 Sleep and Wakeup

Want to get rid of busy waiting.

Producer-Consumer Problem

  • We’ll use this problem as an example for the mutual exclusion techniques we’re about to discuss.
  • Two processes share a fixed-size buffer
  • One process produces things and puts them on the buffer.
  • The other consumes things from the buffer and does something with them.
  • Access to the shared buffer must be carefully managed.
  • Conceptually, we want the producer to sleep when the buffer is full and wake up and add things to it when there is space. Conversely, we want the consumer to sleep when the buffer is empty and wake up and take things from it when it is not empty.

2.3.5 Sempahores

  • One way to achieve sleep and wakeup is with semaphores.
  • A semaphore is an integer variable that supports two operations: up and down.
  • down decrements the semaphore. If the semaphore is zero, the process that called down blocks until the semaphore is no longer zero.
  • up increments the semaphore.
  • Semaphores can represent “tickets” or “passes” to do a specific operation.
  • You can use them like this to synchronize two processes, imposing a certain ordering on their actions.
  • For example, in the producer-consumer problem, you can use empty and full semaphores to track how many spaces are empty and full in the buffer.
  • Then you can force the consumer to do a down on full so that it blocks when there’s on the buffer to take, and force the producer to do a down on empty so that it blocks when there’s no space for it to add to the buffer.
  • This assures that consumption only happens after production and production only proceeds to a certain point before some consumption happens.

2.3.6 Mutexes

  • A mutex is another way of using a semaphore. It only has the values 0 and 1.
  • If the value is 1, processes can enter the critical region. If 0, they block on trying to enter the critical region.
  • If your processor has a TSL instruction, you can implement mutex locking and unlocking in user space.
  • Importantly, this means you can implement mutexes so that when a thread blocks on a mutex, the thread can yield the CPU before blocking.
  • Most modern OSes allow limited sharing of memory between processes, which allows mutex sharing.

2.3.7 Monitors

  • A collection of variables and functions that guarantee only a single thread at once can access them.
  • Monitors are a programming language-specific feature. The compiler handles mutual exclusion, often using a mutex or semaphore under the hood.
  • Monitors still need a way to block processes that can’t proceed anymore so that another process can start using the monitor. They do this with…

Condition Variables

  • Two operations: wait and signal.
  • If a process can’t continue, it waits on a condition variable and blocks.
  • Then another process can enter the monitor and wake up the sleeping process by doing a signal on the condition variable.
  • You can either immediately suspend the signaling process, resuming it later, and let the awakened process run; or you can require a signal to be the last thing in a function.
  • Without the monitor’s guarantee of only one process running the code at once, wait and signal are subject to race conditions where signals cross and wakeup signals disappear into the void just as a process is about to go to sleep, causing them to never wake up.

Java objects can also be used as monitors with the synchronized keyword and the wait and notifyAll methods. This combined with threads is the basis for all Java concurrency.

2.3.8 Message Passing

  • Semaphores and monitors require shared memory, so they can’t be used in distributed systems.
  • Message passing, on the other hand, is more general and works even across machines.
  • Uses two operations: send and receive. These are implemented as system calls.
  • send gets a process ID and a message.
  • receive also gets a process ID and in C implementations it gets passed an empty message to hydrate.
  • A receiver can block until a message arrives, or return immediately with an error code.

Message passing raises several design challenges.

  • Messages can be lost by the network. You can have receivers send back an acknowledgment signal, and have senders retry the message if they don’t get it after a certain timeout.
  • Local message passing is slower than semaphores or monitors.
  • You can implement message passing with a mailbox that buffers received messages, or you can have threads block and wait on each other. The latter is called a rendezvous.

2.3.9 Barriers

  • Block a group of processes all at the same point in the program to coordinate their start on the next phase.

2.4 Classical IPC Problems

These are all well-known and there’s a ton of information out there on them, so I didn’t take detailed notes about them.

2.4.1 Dining Philosophers

Models a situation where there is possible contention for a set of shared resources, which can lead to possible deadlocks or starvation.

2.4.2 Readers and Writers

A simplistic model of a database, where multiple concurrent processes can read but only one can write at a time.

2.4.3 Sleeping Barber

Models a situation where a bounded size queue of tasks is being serviced by a single process but filled by multiple processes.

Notes on Modern Operating Systems, Chapter 2, Part 1: Processes and Threads

These are my notes on Chapter 2, Sections 2.1 and 2.2 of Modern Operating Systems, Second Edition, by Andrew Tanenbaum. The chapter is quite long, and dense with text, so I’m splitting my notes into three parts. Part 1, this part, is the basics of processes and threads. Part 2 will be Sections 2.3 and 2.4 on interprocess communication and concurrency control. Part 3 will be Section 2.5 on scheduling.

As with Computer Networking: A Top-Down Approach, I bought an old edition of this book for cheap about six years ago and happened to have it laying around. The second edition is from the year 2000, so it’s now twenty years old and somewhat out of date. But operating systems haven’t changed as much in twenty years as the Internet has, so although some of the chapters talk as if single-core processors are still the status quo, and the Case Studies section in the back talks about Windows 2000, I’ve yet to stumble on anything that makes me suspicious.

The writing style is clear, but the sections aren’t always the best organized and sometimes make references to material that hasn’t been covered yet. Partly that’s just inherent to the way operating system concepts all connect into a whole, so although it does bother me a little, it was a reasonable choice. If you can find a newer edition for under fifty dollars it might be worth picking up. There are a couple sections towards the end of the book where Tanenbaum talks about his work on MINIX and his opinions on Linux and Linus Torvalds that read like aggrieved Usenet posts cleaned up for publication, and he also constantly dunks on other people’s research, writing in Chapter 3 that research on distributed deadlock detection is not “even remotely practical in real systems” and that its “main function seems to be keeping otherwise unemployed graph theorists off the streets”. I find it highly amusing, not least because Tanenbaum must have known that the undergraduate students who would mostly be reading this book didn’t care.

Section 2.1 Processes

  • Processes pretend to run sequentially, but behind they scenes they run discontinuously as the CPU swaps between them.
  • Each process maintains its own logical program counter which is loaded into the real program counter when the CPU picks it up to run.
  • Run order and run time of processes might be non-deterministic, so you cannot make any assumptions about when processes will run or how long they will run for.

2.1.1 Process Creation

Four events cause process creation:

  1. System initialization
  2. System call by a running process (fork on Unix).
  3. User request to create a process (e.g. running a terminal command, clicking an icon).
  4. Batch job initialization (on mainframes, submitted by users).
  • fork on Unix creates new processes.
    • The child process starts with the same memory, environment strings, and open files as its parent.
    • Then it runs execve or some similar system call to change the memory image and run a new program.
    • The child’s memory image is a copy of the parent’s into a new address space. Children do not share address spaces with their parents, so the parent and child cannot access each other’s memory.

2.1.3 Process Destruction

Processes can terminate for the following reasons:

  1. Normal exit (voluntary), triggered on Unix by the exit system call
  2. Error exit (voluntary), e.g. program cannot finish, such as compiler called on nonexistent file
  3. Fatal error (involuntary), e.g. divide by zero error, attempt to access protected memory (null pointer dereference in C)
  4. Murder by another process (involuntary), e.g. someone calls kill on it.

2.1.4 Process Hierarchies

  • Do not exist on Windows.
  • On Unix a process and its descendants form a group. Process groups share some signals and other things.
  • Since all processes on Unix are children of the init process, there is also a global process group.

2.1.5 Process States

Three basic states:

  1. Running (using the CPU)
  2. Ready (stopped, but could be picked up and run at any time)
  3. Blocked (unable to run until some event happens)
  • Running ⬌ blocked can happen due to a system call made by the running process, or automatically if the process reads from a pipe or a special file like a terminal or socket when there’s no input available.
  • Running ⬌ Ready is managed by the process scheduler, which decides what process to run next.

2.1.6 Implementation of Processes

  • The operating system has a process table that stores the following data for each process:
    • Process state (running, ready, blocked)
    • Program counter (i.e. address of the next instruction to execute)
    • Stack pointer (i.e. address of the top of the runtime stack)
    • Memory allocation
    • Status of open files
    • Accounting and scheduling information
    • Anything else that needs to be saved between process switches

The process table enables interrupt execution. (i.e. hardware interrupts, which “interrupt” the currently running process to deliver something from a hardware device)
1. Hardware pushes the program counter, registers, state, etc. to the stack
2. Hardware loads a new program counter from the interrupt vector, a special location in memory assigned to each class of IO device (e.g. hard disk, timer, terminal devices in mainframes).
3. Assembly language procedure saves the current values of the registers in the process table.
4. Assembly language procedure sets up a new stack.
5. Interrupt service written in C runs. It reads input from the IO device and buffers it into memory.
6. Now that the input from the hardware has been pulled in, the scheduler runs and decides which process to run next.
7. The C procedure returns control to assembly
8. Assembly starts the new current process; it loads up the registers, resets the program counter and stack pointer, etc.

2.2 Threads

2.2.1 The Thread Model

  • A process can have multiple threads of execution
  • Processes group resources together; threads are the thing that actually runs on the CPU.
  • Each thread has its own:
    • Program counter
    • Registers
    • Runtime stack with execution history
  • All the threads in a process share:
    • An address space (so they can access each other’s memory)
    • Global variables
    • Open files
    • Child processes
    • Pending alarms
    • Signals and signal handlers
  • All threads in a process share memory so they can read and write each other’s variables and even stacks. This is part of the benefit of threads; it allows more efficient cooperation between concurrent entities (compared to processes, which can’t read each other’s memory and thus must use other means of communicating).
  • Threads have four states:
    1. Running
    2. Blocked
    3. Ready
    4. Terminated
  • Processes start with one thread and create more using library calls.
  • Threads can exit by calling another library procedure, e.g. thread_exit.
  • In some systems threads can wait for other threads to exit by calling thread_wait.
  • thread_yield gives up the CPU for another thread. The CPU won’t actually stop a thread until a process switch occurs, so threads in the same process have to yield to each other.
  • Threads introduce many complications, including synchronizing their shared memory and coordinating the yields so you don’t waste time running blocked threads.

2.2.2 Thread Usage

  • Threads let a single application do multiple things, modeled as sequential jobs running side by side instead of as a mess of interrupts and signals.
  • On multiprocessor systems or multi-core CPUs, multiple threads can actually run simultaneously. On single-core systems threads will switch on and off the one core.
  • Threads can start up faster than processes since they require fewer resources to get going.
  • Threads can block and cooperatively switch off since they can signal each other using their shared memory.

2.2.3 Threads in User Space

  • Processes manage their own threads in user space.
  • The kernel knows nothing of threads.
  • Each process has a thread table with the threads’ program counters, stack pointers, registers, etc.
  • The threads run on a run-time system, a library of thread-managing procedures


  • Thread switching is very fast since applications don’t need to trap to the kernel to switch.
  • Applications can implement custom thread scheduling.
  • User-space threads scale better since they don’t require space in the kernel.


  • If a user-space thread makes a blocking system call, it will block without giving you a chance to switch threads before handing off control to the kernel, because the kernel knows nothing of threads and cannot give you that chance. This means the whole process is blocked.
  • If a thread causes a page fault, e.g. because it needs to run some code that isn’t loaded into memory yet, it gets blocked, again without allowing any chance to switch threads.

Possible solutions to the drawbacks (that all suck):

  • The select system call can detect when a system call will block. You can use it before making a possibly blocking system call, and if it detects that the system call will block, you can switch to another thread and make the system call later. E.g. if the current thread needs to read some data, but the read call will block on a disk read, you can switch threads until the interrupt executes later and the data is buffered in memory, and then switch back to that thread.
  • Using select sucks because you have to wrap all your blocking system calls to do the select check, so you have to modify the kernel, and it’s also annoying and tedious logic to implement.
  • You can also use a clock interrupt to stop the currently running thread every so often and check if it’s been blocked and should be swapped for another thread.
  • Using a clock interrupt sucks because it’s an ugly hack (you might still let the blocked thread have the CPU for a while before the clock interrupt stops it), and apparently each process only gets one clock so if you use it to manage thread swapping you can’t use it for anything else.
  • System calls can all be rewritten to be non-blocking. E.g. rather than waiting for input to be available, a read call would just return 0 (for 0 bytes read) and you could try again later.
  • Rewriting system calls sucks because you have to rewrite a bunch of system calls around threads and that’s not cool.

2.2.4 Threads in the Kernel

  • You can instead put threads in the kernel. The kernel then manages the thread table and thread swapping instead of the user program.
  • All blocking calls are system calls and when a thread blocks, the kernel can choose another thread to run. If there is no suitable thread in the same process, the kernel can choose to let a different process run for a while.
  • Creating, destroying, waiting, etc. are all more expensive since they are system calls.

2.2.5 Hybrid Implementations

  • The idea is to combine the lightweight nature of user-space threads with the ease of swapping out blocked threads that comes with kernel threads.

2.2.6 Scheduler Activations

  • The kernel assigns each process some virtual CPUs (or they can request and release them). [It’s not actually clear to me from the text how the virtual CPUs play into this process.]
  • The kernel can detect when a thread blocks and notify the process through a mechanism called upcalls.
  • Upcalls work by activating the process’s runtime system at a known starting address, similar to a Unix signal. Once activated, the runtime system can choose to run another thread instead of the blocked one.
  • When a hardware interrupt runs, the process will be resumed and the runtime system can decide which thread to run. It can run the thread that was interrupted, or, if the hardware interrupt completed something that a blocked thread cares about (like reading in some data and buffering it in memory), it can choose to run the blocked thread, or it can just choose to run a completely different thread.
  • This system isn’t great because you now have two-way calls between the kernel and user space, whereas a normal operating system has a nice layered model where user space calls the kernel but not the reverse.

To be continued next time with interprocess communication. And then…in the epic final part of Chapter 2…SCHEDULING!!!

Notes on Computer Networking: A Top-Down Approach Chapter 2

These are my notes on Computer Networking: A Top-Down Approach (4th Edition) Chapter 2, Application Layer.

I don’t particularly recommend this book. It came out in 2008, and a lot’s happened to the internet since then. It’s not that well organized; you’ll see if you read the notes that they take some bizarre leaps from topic to topic. And it tends to be extremely wordy; these notes condense almost 100 pages of information. Because of that, you’ll notice that I didn’t take notes on every section. But I happened to have this book laying around, so I read it. I’m putting up the notes for my own future reference and in case someone else someday finds them useful.

2.1.1 Application Architectures

  • Client-server: one process (client) initiates communication, another (server) responds.
  • P2P: No central server, any process can act as either a client or a server.
    (Note: “process” refers to an operating system process, i.e. a running program.)
  • Sockets are the interface between the transport layer (of the OSI layer model) and the application layer. A process listens to a socket for messages and can put messages into the socket to send them across the network to the other end.

2.1.3 Transport Services Available to Applications

  • Messages pushed into a socket are sent across the network using a transport layer protocol.
  • There are four services transport layer protocols could theoretically provide.
    • Reliable data transfer
    • Throughput guarantees (e.g. guaranteed delivery of bits at 50 kbps)
    • Timing (e.g. guaranteed receipt of messages within 100ms)
    • Security

2.1.4 Transport Services on the Internet

  • TCP
    • Uses handshaking to establish a connection between client and server
    • Guarantees lossless data transfer
    • Implements throttling to reduce congestion over the network; this means it provides no timing or throughput guarantees.
  • UDP
    • Does not use handshaking or establish connections; packets sent over UDP might just disappear and never reach their destination.
    • Does not throttle.
    • Theoretically, UDP would be good for loss-tolerant time-bound applications, since the lack of throttling often means more throughput and faster delivery. In practice, UDP is usually blocked by firewalls, so no one uses it.
  • Neither protocol provides throughput guarantees, timing guarantees, or security, and UDP does not provide reliable data transfer either, which makes it puzzling that we took the time to go over those four services above. Applications built on these protocols must cope with the lack of these services.
  • SSL is an application-layer add-on to TCP that provides encryption; other application layer protocols send their messages through SSL, which encrypts them and sends them into the socket, which uses TCP to send them to the other end, where the encrypted message is read from the socket by SSL on the other end and decrypted before being passed to the application layer protocol on that end.
  • The transport layer uses IP addresses and port numbers to find the correct destination machine and socket on that machine.

2.2 The Web and HTTP

  • HTTP is defined by RFCs 1945 and 2616.
  • An HTTP server stores no information about the clients, so HTTP is a stateless protocol.
  • HTTP uses persistent connections by default, i.e. after opening a TCP connection, it will keep that connection open and reuse it for a period of time for subsequent interactions. This increases performance; establishing a connection requires an extra network round trip to go through the handshaking process, and a single web page might require several requests to fully load (to get the HTML, Javascript, CSS, images, etc.), so reusing a connection reduces that overhead.

HTTP request message format:
– \<HTTP method, e.g. GET, POST> \<URL> \<version, e.g. HTTP/1.1>
– <header name>: <header value>
– <More headers>
– Empty line
– <Entity body, which can contain any data>

HTTP response message format:
– <version> <status code, e.g. 400> <explanation of status code, e.g. “Bad Request”>
– <header name>: <header value>
– <More headers>
– Empty line
– <Entity body>

2.2.4 Cookies

  • Defined in RFC 2965
  • Response header Set-cookie tells client to store something, such as a user id, in a cookie file, keyed by the host name.
  • Request header Cookie contains the value set by the host to which the request was made.
  • Cookies allow servers to maintain information about clients even over the stateless HTTP protocol.

2.2.5 Web Caching

  • Cache make go fast
  • The cache can send requests with an If-Modified-Since header; if the object being requested has not been updated, the server will return status 304 Not Modified, telling the cache it can return the version it has.
  • The book doesn’t mention this, but ETags are another mechanism used for cache validation.

2.3 FTP

  • FTP allows navigating a remote filesystem and transferring files to and from that filesystem.
  • FTP uses two TCP connections
    • A control connection on port 21 sends user id, password, and commands
    • A data connection on port 20 sends actual files.
  • Using the separate connection for control information is called sending control information out of band.
  • FTP is not stateless; it stores username, password, and current directory for clients.
  • Control connections are persistent; they are kept open for a period of time and reused.
  • Data connections are not persistent; they are opened by the server in response to user commands, used to send or receive a single file, and closed.

2.4 Email

  • Uses two types of protocols: mail send (SMTP) and mail read (POP3, IMAP).

2.4.1 SMTP

  • Uses TCP’s reliable data transfer.
  • Has both a client and server mode. Message senders are clients, message receivers are servers.
  • All text in an SMTP message must be 7-bit ASCII because the protocol is extremely old.
  • Uses port 25.
  • Handshakes to establish a connection.
  • Uses a single TCP connection to send all outgoing messages it has.
  • Is a push protocol—clients mainly send data to servers. By contrast, HTTP is mainly a pull protocol; clients typically receive data from servers.
  • RFC 822 defines SMTP and the headers for sending ASCII text.
  • MIME (Multipurpose Internet Mail Extensions) defines headers for sending non-ASCII text.
    • Defined in RFCs 2045 and 2046.
    • Uses the Content-type and Content-Transfer-Encoding headers.
    • Content-type tells the receiver what the data actually is, e.g. image/jpeg, audio/vorbis, text/csv.
    • Content-Transfer-Encoding tells the receiver how the data was converted to 7-bit ASCII text, e.g. base64.

2.4.4 Mail-Access Protocols

  • Since SMTP is push-based, it can’t really be used to retrieve mail by non-server clients like desktop PCs.
  • POP3 and IMAP are pull-based protocols for email clients to retrieve mail.
  • POP3 is very simple; it cannot manage email directories on a remote server. It has “read and delete” mode, where it downloads mail and deletes it from the remote server, as well as “read and keep”, where it leaves the mail on the server.
  • IMAP is much more complex.
    • It can manage mail directories.
    • It can obtain parts of messages, which can be useful when bandwidth is low; you can avoid downloading large video and audio files.
  • Web mail clients like Gmail just use HTTP to retrieve messages. POP3 and IMAP are useful for desktop clients like Thunderbird and Outlook, or for terminal clients like Mutt. Gmail and other web-based email services sometimes provide POP3 or IMAP access so you can use clients like Thunderbird and Mutt to read your mail. Information to access Gmail using SMTP and IMAP from an email client.

2.5 DNS

  • DNS (Domain Name System) translates human-readable domain names to numeric IP addresses.
  • The term “DNS” actually refers to two things:
    • A distributed database implemented in a hierarchy of DNS servers.
    • An application-level protocol for querying this database.
  • DNS servers often run the Berkeley Internet Name Domain (BIND) software.
  • DNS runs on UDP and uses port 53.
  • Clients implementing other protocols (e.g. HTTP, FTP, SMTP) will include a step where they use DNS to translate user-supplied hostnames into IP addresses.
  • DNS can add a delay to requests, but the delay is minimized with caching in nearby servers.
  • DNS also provides:
    • Host aliasing: can translate a short alias into a longer canonical hostname
    • Mail server aliasing: can give a web server and mail server the same alias but different canonical hostnames
    • Load distribution: (proceed with caution; I have doubts that this is still a good idea. I asked a question on SE about this: DNS can map a hostname to a set of IP addresses. You can map a hostname to a set of replica servers all running the same application. The DNS server will (according to the book) rotate the order of the IP addresses it returns, and clients will (according to the book) always take the first one, so requests will be distributed across your servers.

2.5.2 How DNS?

DNS uses four types of servers:
– Root DNS servers. There are 13 of these in the world, controlled by various institutions and entities. They are the first stop in a DNS lookup.
– TLD DNS servers. These each service a top-level domain (.com, .edu, .gov, .jp, .fr, .tv, etc.). Different companies own the different TLD servers.
– Authoritative DNS servers. The owner of a domain controls the authoritative DNS server for the domain.
– Local DNS servers. These act as brokers / caching centers for a local area.

A basic example of a DNS interaction.
1. A client requests an internet connection to an ISP. The ISP provides the address of one of its local DNS servers using the DHCP protocol. The local DNS server will be within a few routers of the client for speed.
2. The client will send requests for hostname lookups to its local DNS server. The local DNS server will forward them to a root DNS server.
3. The root DNS server will send back the identity of a TLD server that can handle the request. E.g. if the hostname is, the root DNS server will send back a TLD server that can handle .com.
4. The local DNS server contacts the provided TLD server. The TLD server looks at the entire domain and returns an authoritative DNS server for that domain. E.g. if the hostname is, the TLD server for .com will send back an authoritative DNS server for
5. The local DNS server contacts the provided authoritative DNS server. The authoritative DNS server will provide the IP address of a machine mapped to the hostname, possibly after communicating with other authoritative DNS servers to resolve subdomains. E.g. the authoritative DNS server for might contact another authoritative DNS server to resolve to an IP address.
6. The local DNS server returns the IP address to the client. The client uses that IP address to send a request.

The local DNS server can cache a lot of the information it receives for later reuse. This allows it to skip a lot of the steps. For example if the local DNS server caches the IP of, it can immediately send that IP address back to any client who requests it. It can also cache the authoritative DNS server for, so if a client requests any subdomain of, it can go straight to that authoritative DNS server instead of going through the root and TLD servers again.

DNS cache is busted after a period of time, typically 24–48 hours.

2.5.3 DNS Messages and Resource Records

  • A resource record is a four-tuple: (Name, Value, Type, TTL)
  • The TTL is the time before the record should be evicted from cache.

Four types:
A: Name is host, Value is host’s IP address. Nowadays we also have AAAA for IPv6 addresses.
NS: Name is domain (e.g., Value is hostname of authoritative DNS server for that domain, e.g.
CNAME: Name is an alias, Value is the canonical hostname.
MX: Name is an alias, Value is the canonical hostname of a mail server. This exists so a mail server and another kind of server (such as a web server) can have the same alias but different canonical hostnames, so you can have point to both the web server and the mail server Web clients will request the CNAME record for and get back the value, while mail clients will request the MX record and get back the value

Note1: There are actually an assload of DNS record types.
Note2: On MacOS and Linux (probably also through WSL on Windows) you can use the terminal command dig to do DNS lookups.

Format of a DNS message:

16 bit ID 12 bits of flags
number of questions number of answers
number of authority records number of additional records
  • The 16-bit ID links requests and responses together.
  • The flags indicate such things as whether the message is a query or reply, if the queried server is authoritative for the queried name, or whether to allow recursive queries (a DNS server receiving a query and querying another DNS server to fulfill it).
  • The numbers indicate the number of DNS records returned in the later sections.
  • Questions are queries with host names and types (A for host address, MX for mail server canonical name, etc.)
  • Answers are only in replies and are the answers for the queries (IP addresses for A queries, canonical names for CNAME and MX queries). You might get multiple answers when a hostname has multiple IPs.
  • Authority is the records of other authoritative servers.
  • Additional depends on the query but will contain other useful information. E.g. for an MX query, the answers section will contain the canonical hostname of the mail server, and the additional section will contain the A records mapping the canonical name to an IP address so you don’t have to query again for them. Similarly, for an NS query, the answers section will have the canonical names of the authoritative DNS servers, and the additional section will have the A records for the IP addresses. You can mess around with dig to see these.

Records are inserted to DNS by paying a registrar, which are authorized by ICANN, to get your hostname mapped to your authoritative DNS server in a TLD server for your top-level domain.

There were two more sections, one about P2P protocols and BitTorrent, another about implementing custom application-layer protocols in Java by using the various socket classes. I read the P2P section but chose not to take extensive notes on it. I skimmed the implementation section.

Aphorisms from The Pragmatic Programmer(‘s first half)

I can’t remember the last time I read more than halfway through a tech book. Tech books tend to peak early. You get more than half the value of the first half of the book. A lot of times the second half is specialized or optional topics and a lot of times those topics are out of date unless you buy the book the second it’s off the presses. Of course, I’m also lazy, and about halfway through a tech book I’m usually starting to lose interest and decide I’m okay with building CRUD web apps for the rest of my life if it means I don’t have to read any more of this.

I still haven’t made it all the way through The Pragmatic Programmer either, but I’ve made it further than I usually do—about two-thirds of the way through. And I actually want to read the rest of it. I don’t always like its metaphors or cutesy terminology like “binary chop”, but it’s memorable and easy to read.

The Pragmatic Programmer is structured as a list of 100 tips. Some of them cover code structure and design (“Tip 47: Avoid global data”). Some cover development practices (“Tip 28: Always use version control”). And some relate to personal development (“Tip 11: English is just another programming language”), philosophy (“Tip 3: You have agency”), and ethics (“Tip 99: Don’t enable scumbags”). Each tip cross-references related tips. The end result feels more like reading a bunch of blog posts than a single coherent book. The tips are grouped broadly by topic, but a single tip can go from explaining a technical concept to coding advice to rats chewing through cables. As “pragmatic” suggests, the book imposes no hard lines on its topics and the commentary on the tips goes wherever it needs to go.

In the rest of this post, I’ll go through some of the tips and my idiotic hot takes on them.

Tips 1–4

I tried to read The Pragmatic Programmer for the first time about six years ago. I looked at the Kindle sample, which only had Tips 1–4. I thought, “Wow, these guys are smug as hell”, deleted the sample, and didn’t come back to the book for another six years.

Tips 1–4 are basically the preamble that some college professors put in their syllabi where they lecture you about how you’re an adult now, you need to take responsibility for your own actions, only you benefit from attending this course and finishing your assignments, so on, yada yada, etc. If you’re a bit of a beaver-cleaver, you’ll read these four tips, pat yourself on the back for being so responsible and never making lame excuses, and a deep contentment will warm you from the heart outward for the rest of the day. Otherwise you can just skip to Tip 5. Go on, be a rebel. I won’t tell.

Tip 5: Don’t Live with Broken Windows

When it comes to software rot (or “technical debt”), there is definitely some truth to this idea. If developers feel the code is badly designed and full of hacks, they will contemptuously pile more hacks on top. I’ve seen it happen firsthand. We spent seven months talking about rewriting the entire codebase because the whole thing was such a cancerous dumpster fire that it could never be improved. Then most of the team left because the code was such a cancerous dumpster fire, and about a year after that, management finally gave the go-ahead to rewrite the code.

On the other hand, The Pragmatic Programmer pushes an absolute zero tolerance policy towards broken windows, which has not been realistic in my experience. All the codebases I’ve worked on were written as quickly as possible by people who are long gone. They’re usually bad in various ways. Sometimes in ways that are easy to fix—bad formatting can be fixed by automated tools, and fancy IDEs like IntelliJ IDEA can stick a big glowing flag on certain kinds of anti-patterns and code smells. Sometimes you can go back and patch up the rot and make sure to do better going forward. But sometimes the broken windows are so fundamental or widespread that you can’t fix them without tearing down huge swathes of code and rebuilding deep-seated parts of it. And sometimes you simply aren’t allowed the time to clean up bad code, because your team is tiny and you can’t afford to put a whole developer on a cleanup job for a month unless customers are complaining about it.

However, sometimes, even if you can’t fix the broken windows, you can at least quarantine them so most of your daily work doesn’t touch that code. One codebase I worked on had an absolutely hideous subsystem where a developer wrote all the code in a single god class of 500-line methods that mutated local variables and then returned them to be passed as arguments into other methods that also mutated them. It was horrid, and so brittle that with the original developer gone there was no chance of fixing it without breaking something, but on the bright side, it was all in a single god class, so we could just never look at that class until we had to.

Tip 9: Invest Regularly in Your Knowledge Portfolio

Steve Yegge points out in several of his blog rants that the half life of technical knowledge can vary a lot. So while it might be tempting to study the hip new language or framework that everyone on Hacker News is talking about, it’s a good idea to invest in long term assets as well.

Algorithms, data structures, math, and statistics rarely, if ever, go out of date. Unix tools are practical, have been around forever, and show no sign of going away, so getting familiar with grep and sed can be a great investment. It’ll pay off when you have some weird bug that only happens on the third Thursday of every month and you can bust out a grep command to find all those logs in your log files without even looking at the man pages. find, tr, cut, awk, curl, nc, jq, and screen are also great commands to look into, as well as the shell language itself. Recently I finished a script in four hours that I thought was going to take two days, because I realized I could replace a bunch of API calls and ugly JSON structure rearrangement that I was going to write out in Python with a short bash script using curl and jq. Editors like Vim and Emacs have also been around forever, and learning some of the more powerful commands can save you a ton of time. For a while, I had to debug data pipelines, which usually meant receiving a CSV of data that someone wasn’t satisfied with, taking the UUIDs out of it, and querying a bunch of tables to find out where in the pipeline data got dropped or corrupted. cut and Emacs saved me a ton of time; I could use cut to extract the UUIDs from the CSV, then paste it in Emacs and use replace-regexp to quote them and reformat them into a comma-separated list that I could just paste into an IN clause of an SQL query. Which brings me to SQL—it’s been around since the 70’s, and unlike disco, it doesn’t seem to be going anywhere, so you could do worse than to study it. So many of the weird bugs ORMs cause only become clear once you understand the SQL it must be generating.

The Pragmatic Programmer also suggests investing time in some of the weirder languages to have a bigger stock of ideas to draw on. I think this idea bears repeating. I see too many people online who know Python, Ruby, and Javascript and want to have a language war over which of those three is better. I also see people who know Python, Ruby, and Javascript complain that Java is ancient, inscrutable gibberish that somehow is also the most widely used language in the world. I was this person. When I first started learning to program, I wrote a huge rant (that I thankfully never published) comparing Visual Basic, Java, and Python. My arguments were that Visual Basic was better than Java because you have to write End If, so you know it ends an If, whereas in Java it’s always just a brace, but Python was the best because whitespace. I was stupid. What I was doing was equivalent to reviewing three books which all cover the same material in the same order and only comparing what fonts the text was printed in. I definitely support spending time with Icon, Prolog, OCaml, Io, or Idris, and learning just how different a language can be, before discussing differences between languages.

Tip 10: Critically Analyze What You Read and Hear

I majored in computer science, but before that, I was an English major, and before that, I spent ten years wanting to be an English major. One of the most important things you learn studying humanities is exactly what this tip says—how to critically analyze what you read and hear.

When you’re evaluating a new technology, be critical about the claims. Don’t just accept flashy benchmarks; probe the methodology behind those flashy benchmarks. Think about the context behind claims: maybe a group at Amazon is building great things with an experimental JIT compiler for Lua, but does that mean your team will be able to do the same? Maybe the software actually isn’t business critical and only has five users. Maybe the creator of the experimental JIT compiler is on the team and can fix any bugs they run into. Maybe Amazon, being Amazon, can make it worth the creator’s while to solve their problems.

Being critical doesn’t mean being constantly and unreservedly cynical. You can be positive about things. You can love Rust if you want. But you should love Rust because you critically analyzed the claims made about it and decided they were reasonably accurate, and you should always be on the lookout for new information that contradicts your current point of view. When you find that new information, critically analyze it and decide if it might be true or not. Then decide if you need to care. Even if your favorite language chokes on 5 HTTP requests per second, you can still enjoy hacking together a hobby project with it. But be honest; don’t convince your company to build something in your favorite language when there’s a strong chance the system will need to handle 1,000 HTTP requests per second.

Tip 14: Good Design is Easier to Change Than Bad Design

This is true, but no design is infinitely flexible. You always have to make some choices that will lock you in. I’ve seen (and written) a ton of code that was easy to extend in a way that turned out to never be necessary. I’ve also seen (and written) a ton of code that was built to be extensible, but not quite extensible enough to capture a later use case. This is where the tips about making your design orthogonal and decoupled come in. If you can rip out a whole module and replace it with something else, then you can change it, even if it’s not as easy to change as it would be if you’d thought of something when you first designed it.

Tip 17: Eliminate Effects between Unrelated Things

This is the larger principal behind a ton of well-known best practices. Don’t use global variables, because they provide a channel for two unrelated things to affect each other. The C preprocessor is a pain because unrelated files can affect each other if they happen to be compiled together. I also find a lot of semi-common practices in hyper-dynamic languages like Ruby, Javascript, and Python to have a lot of potential to create effects between unrelated things. Javascript didn’t used to have real modules, so libraries would shove things in the global namespace. It was a complete nightmare to deal with libraries fighting for symbol real estate in the window object. (JS for some reason allows just $ as a function name, so everybody fought over $ until jQuery came in and definitively staked claim to it.) I’ve had some infuriating bugs due to Ruby’s ability to define a class across several files combined with Rails’s implicit global imports. In Ruby, if you define a class with the same name in different files and load both of those files, they will be combined into a single class. Rails will automatically load every file you put in its blessed directories. So I’ve encountered ridiculous bugs where two programmers made classes with the same name in different contexts, and Rails helpfully imported both files, causing Ruby to combine those two unrelated classes into a single class and breaking code in two places that had nothing to do with each other.

Tip 19: Forgo Following Fads

This tip is pretty useless. The idea is good, but it offers no guidance on figuring out what’s a fad and should be ignored. I’ll offer a pointer back to Tip 10, though: critically analyze what you read and hear.

Tip 20: Use Tracer Bullets to Find the Target

This tip is awesome and the book is worth the cover price just for this.

The basic idea is that when you start a new system, you build an end-to-end skeleton of the entire architecture with some example flows. Stub out anything you have to, just get the end-to-end working as quickly as possible. This will show you what the architecture you have planned will look like in practice. It’s called a “tracer bullet” because it lets you see where you’re aiming, and you can assess how close your shot is to the target.

The book suggests showing your skeleton to customers. This probably depends on your organization’s culture and relationship to customers. Nothing I’ve worked on benefited from being shown to customers. The customers would get hung up on details like the color of the banner on mocked up web pages, or they would seize on random technical details that they happened to know about (“It won’t allow escalation of privilege, right?”, “The JSON responses are small and streamlined, right? That makes it fast.”, “Did you use Angular? Angular makes rich interfaces.”), or they would look at it, grunt, and go back to whatever they were doing. But it can be a great benefit to show it to engineering leadership, or to engineers on other teams, or product managers or UX experts. And if you happen to have customers able to give useful feedback, sure, show it to them.

Tip 32: Read the Damn Error Message

Part of this tip discusses rubber ducking, which is where you explain your problem to a rubber duck because your problems are idiotic and should not be inflicted on other humans. Because your problem is so idiotic, the rubber duck’s silence will make you feel shame, causing you to quit the industry and take up competitive bass fishing.

In case you don’t want to quit the industry yet, I’ve found that a more productive strategy than talking to a duck is writing out what the buggy code is doing as if you’re manually running through something you’ve written for an interview. If you don’t know why something happens, note that and the result and move on. Start a little before the bug, and keep going to a little bit after—the moment where the exception is thrown or the erroneous result returned is usually a good place. For me this usually causes something to click eventually, and I see what I’m missing. It can also help, once you’ve isolated the buggy code and the bad result it outputs, to treat it like a brainstorming session—what are five ways this bad result could have been created? Get completely crazy; “solar wind knocking an electron out of alignment, causing the bits to swap” is on the table. How do those five ways relate to the surrounding code? What else is going on in the program at the same time this code is executing?

Tip 37: Design with Contracts

I’ve come around a little on statically typed languages. Java is still kinda painful to write, but at least it’s also kinda painful to write for that guy who was three months out of university when he was asked to write an entire analytics tool in a week and didn’t have time to write comments or real error handling—his code is probably a mess, but at least Java forced him to leave a few more breadcrumbs about what the hell he was doing than Python or Javascript would have. And newer languages like Go and Rust reduce a lot of the annoying parts of Java by adding better type inference and other goodies.

Statically typed languages force you to write basic contracts for every function. It’s nicer to write contracts that express some real invariant about your data—“x represents an age, so it must be between 0 and 200″—but at least “x is an integer” is more than nothing.

But contracts are most important for system boundaries, where different people’s code comes together, and nowadays that often happens over a network. So that’s where statically typed data transfer formats, like protocol buffers, become useful.


There’s lot of good advice in The Pragmatic Programmer, and most of the tips I didn’t bother to discuss because I just nodded when I read them. But there’s also some stuff implicit in the book’s worldview that I don’t agree with—I have a whole rant about “people are not their code” that I might go into when I finish up the book. I recommend you read it, but that you keep in mind Tip 10. Don’t be a beaver-cleaver. Be a cool rebel who studies humanities. Donne is subversive, man.

Why Putting JSON in a Database Column is a Bad Idea. Or, Why Would You Ever Put JSON in a Database Column?

When I was a beginner at databases, I was tempted to put JSON in a string column. It’s a weirdly common thing for beginners to relational databases to want to do. A lot of tutorials aimed at beginners suggest doing it for some reason. When people show up on Stack Overflow wanting to know why their code isn’t working, and that code happens to be putting JSON in a relational database column, sometimes someone will show up and tell them to quit doing that, but it’s only like a 50/50 shot, whereas pretty much every other questionable beginner practice will attract hordes of smartasses, some of them barely more than beginners themselves, who will strongly and sometimes stridently censure the beginner for contemplating it. I attribute the popularity of Mongo and other NoSQL databases in part to this instinct: Mongo is basically a database with nothing but string columns that have JSON in them, and some search functionality for JSON in those string columns.

If you’re sometimes tempted to put JSON in a string column, I hope I can explain today why it’s a bad idea and you shouldn’t do it, in a way that will make sense. On the other hand, maybe you’re one of those relational database savants who understood third normal form right away, and you don’t understand why anyone would ever want to put JSON in a string column instead of having a real schema. If so, I hope I can explain to you why someone might want to put JSON in a string column.

Strong Static Typing, Relational Data Modeling, and Escape Hatches

In “Is Weak Typing Strong Enough?”, Steve Yegge describes (buried somewhere in the point he was actually making about programming languages) how teams that he worked with at Amazon were too boxed in by the strong statically typed strict schema the relational data models imposed on them, and resorted to tactics like an untyped name / value system and passing an XML parameter as a string to a CORBA interface. (CORBA was a standard for an early style of distributed services that let you treat objects running on remote servers as if they were in the same address space and could call each others’ methods. It was also language agnostic, so you could run Java on one server and APL on another server and they could use their CORBA implementations to call each others methods. At least, that’s what Wikipedia said; CORBA predates my programming experience by some years.)

The point here is that relational database models are a sort of strong static typing. No matter how much you love strong static typing, sometimes it’s not flexible enough, and you need escape hatches. In 2005 when Steve Yegge wrote his piece, it was XML strings through CORBA. On a programming language level, it’s things like downcasting from Object in Java or using the dynamic keyword in C#. Storing JSON in a string column is another one of these escape hatches. What you get is flexibility. Let’s keep this in mind as we go into our big example, which I hope will demonstrate both why you shouldn’t put JSON in string columns and why you might want to sometimes.

Hugs and Warm Fuzzies: An Example

This example is loosely based on a real system I worked on that had JSON in string columns.

Let’s say we’re working on an MMORPG called Crayon Art Online. Unlike most MMORPGs, which are all about killing things, Crayon Art Online is all about love and friendship and positivity.

Every few hours, one of our servers needs to kick off a job that will read a bunch of player actions from the game’s main database and calculate metrics on them, storing them in a separate metrics database. Since the game is all about doing friendly, happy things, the actions will be things like hugs_given, crayon_drawings_gifted, lunches_bought_for_others, pep_talks, and positive_affirmations. There will be some actions that have extra associated metrics, like hug_warmth and affirmation_positivity and pep_talk_peppiness. There will be some conditions where we just check for a Boolean answer, like sculpting_class_taken?, which just has a true or false answer. We need to calculate all these metrics for all the players across different periods of time: the past 30 days, the past 90 days, the past year, and the entire period the player has existed. Then we need to store them in a database so another job that runs later can read them and distribute prizes to the players that reach certain goals, and messages of encouragement and hope to the others.

Let’s go into some aspects of our data that we’ll have to model. We have a ton of different actions–let’s say there are about 25 total. It’s not a fixed set; there will be new metrics if the developers add new actions to the game, and they might also remove actions, so we would lose the metrics associated with those. That means making a separate table for each metric is doable, but a huge pain. Every time we add a new metric, we have to add a new table. Reading all the metrics for a player means querying 25 tables. A simple manual check for data consistency between the main game’s database and ours requires us to write a select command that includes 25 tables. It seems there must be an easier way.

However, one table also clearly won’t work unless we store all the values as strings. We need a value column to be able to store integers (hugs_given), floats (average_positivity), Booleans (is_wearing_floppy_hat?), and strings (floppy_hat_color). So another option would be one table per value type: int_valued_metrics, real_valued_metrics, boolean_valued_metrics, and string_valued_metrics. But that’s pretty tacky. Your data model is supposed to model the data, like it says in the name. The data itself doesn’t naturally divide into these different types; they’re all just metrics, with the same fundamental features: a name, a time period, a user, and a value. Our database engine needing to store the value as an integer or float or whatever is an implementation detail, not an actual part of the domain, and it feels unnatural to spread the data model for the single metric concept across four tables (and counting, if you add types later—maybe at some point you have tuple-valued metrics, or BLOB-valued). This is better than 25 tables, and it’s better than one table, but it’s still not good.

At this point, things look pretty dire. Relational data modeling has failed us. But wait, you realize. Every programming language nowadays, well except for Java anyway, has a powerful JSON parsing engine built in that can turn a string of JSON into a native map, with type conversions and everything! So, why don’t we just store metrics as a string of JSON per user?

Strings of JSON have lots of attractive qualities. They’re easy to understand; JSON is dead simple, by construction, whereas SQL and relational data models are complicated and hard to understand. Strings of JSON are easy to parse into whatever kind of map or hash thingy your language supports (unless you’re using Java or Scala), and once you’ve done that, you know exactly what you can do with them. Reading from a relational database usually requires some annoying library that binds the data to native data types, or else you can make raw SQL calls, get handed back the most useless, primitive thing your language supports, and do the conversion yourself. JSON is dynamically typed, so you don’t have to worry about finding the correct data type; just store the string, and when you read it back the parser will figure out what types everything should be. JSON is extremely flexible, letting you store anything from a simple glob of key-value pairs to a bramble of arbitrarily nested lists and maps, whereas with relational data models you’re stuck with two dimensional tables.

So, let’s just forget all this relational modeling stuff and make a table that maps a user id to a string of JSON that looks like this:

    “hugs_given”: {“30_days”: 5, “90_days”: 10, “1_year”: 85, “all_time”: 345},
    “has_taken_respect_seminar?”: true,
    // ...

Now it’s easy to look up a user’s stats. Just select the_json_string from metrics where user_id = 394. Now you have a JSON string with all the metrics. Now, whenever you add a new metric, you just have your code calculate it and stick it in the JSON object before writing it back as a string. No schema changes, no data model cabal to get through, just code sticking things in JSON. If you decide integer-valued stats should also support a count for the last 42 days, just calculate it and stick it in everyone’s JSON.

But here’s a problem: suppose you want to find all the users with more than 26 hugs given in the last 30 days. How do you query this data? SQL can’t read JSON, so you can’t do select * from metrics where the_json_string.hugs_given.30_days > 26. That JSON type is opaque to the relational database. (Unless you’re using PostgreSQL. We’ll pretend you’re not. The real project I worked on used MySQL, but I believe Oracle and MS SQL Server also lack a JSON type.) So how do you write this query? You can write your own terrible ad hoc JSON parser using regexes and LIKE right there in your database console, or you can write a script in some language like Python or Ruby that reads in the data, parses the JSON string, and does the check for you. Writing this script can also get surprisingly tricky. If the entire table doesn’t fit in memory, you can’t just read all the rows and loop over them. The simple approach would be to get the maximum id of the table and then select each row by id, skipping over misses, until you’re done. But this incurs a disk read on each iteration of the loop, plus possibly a network call if you’re not allowed to run the script right on the database server, so it’ll be agonizingly slow. So you’ll probably read in batches. Now you have to mess with the logic around batch sizes and offsets. It probably won’t take more than an hour or so to write this script. Still, a simple select is something you can dash off about as fast as you can type it in. And even if you write the script generically enough that you now have a general purpose script for querying based on values in the JSON string, you still don’t have all the tools available in one place. If you’re working in the database console and suddenly realize you need to query some data in the JSON string to proceed, you have to stop, go to another window, and run the script. There’s extra friction if you need to feed that data into an SQL query at the database console to move on. There’s this gaping hole where you can’t use SQL to find data that’s stored in your SQL database.

Putting data in strings also kills the validation benefits that a statically typed data model gives you. If someone accidentally writes {"hugs_given": "fish"} to the database, the database doesn’t know that that’s not how you measure hugs. If someone accidentally writes {"hugs_given": {"30_day": 5, "30_day": 10} to the database, the database doesn’t know or care that you repeated a key.

How to Proceed?

So now we know both the strengths and weaknesses of storing JSON in string columns. It’s flexible, easy to change, easy to write, easy to understand, easy to work with in code. On the other hand, it’s meaningless to the SQL database (unless you’re fortunate enough to be working in Postgres). It’s just this string of characters that might as well be the complete text of the classic novel Lady Chatterley’s Lover by DH Lawrence. You can’t query it from SQL in any meaningful way. You can’t put indexes on it. (Well, you could put an index, but it would be massive and not very helpful.) You lose the data validation benefits that a relational schema gives you. What do we do?

It’s possible that you feel you can live with the limitations of JSON strings in database columns. Maybe you can. I have a feeling they’ll eventually get in your way, though. They certainly got in mine.

If you have the option of switching data stores, you could just use Postgres, or some JSON-oriented document store like MongoDB, although I have reservations about that in general. In Crayon Art Online, as in the real application that inspired it, metrics are calculated by a separate process and stored in a different database than the main application, so even if your main application is on MySQL, it might be doable to switch your metrics application to Postgres. (It wasn’t for me, for a variety of boring non-technical reasons.)

But suppose you’re convinced that storing an opaque string of JSON won’t work for you, and you don’t have the option of switching data stores. You’re left with trying to find a better data model to fit this into a relational schema. In this particular case, where the nesting isn’t that deep, it’s not actually that hard to come up with an acceptable solution that’s better than JSON string in a column.

To start with, I’d represent each individual calculated metric as a row with a user id, type tag (hugs_given, positivity_quotient, etc.), a date range tag (30_day, 90_day, etc.), a value, and a created_at date to record when we calculated it.

In my experience, rolling date ranges are a lot simpler than actual dates, so even though it might be tempting to get rid of the string keys for 30_day, 60_day etc. and come up with some way of storing the date range based on actual dates, it could also complicate things a lot. If you need to know the actual date range this covers, you can count backwards from the created_at. If you expect to need this a lot, it might make sense to store the date keys as integers that represent the number of days covered by this metric, so we’d have a column days_covered, and then we can find the start date of this period by subtracting that from the created_at.

The main hurdle with representing this in a schema is the fact that the value type can be an integer, real number, Boolean, string, or possibly other things. Making separate tables for each value type is tacky in my opinion, but you could go this route if you want. Then you could use some kind of wrapper at the data access layer to hide the four classes from your code and just present one consistent interface for them. An even more tacky variant would be to make your value column store a foreign key into another table, and make four value tables that hold your values. So you’d have the five tables metric, int_value, boolean_value, real_value, and string_value, and metric would have a type column that tells it which table to look for its values in, and a value_id column that points into one of the value tables. Not only does this have the same disadvantages as the four metrics tables, it’s also weird and complicated to understand. Another option: store the value, and only the value, as a string, and implement some kind of conversion to the correct type at the data access layer. You can still run queries on the numeric values by casting the string to a number, so you’re not as out of luck as you were with opaque JSON strings. None of these are great solutions. This was the original hurdle that led us to putting JSON in a text column, and I don’t know of any really good way out of it. But I do think there are better ways than JSON in text columns.

This is not a perfect, beautiful schema. It’s not a work of art. It’s not going to get written up in textbooks or research papers as amazing. But it is a real relational database schema which will work better for you in the long run than shoving a string of JSON into a column. (Or a string of CSV, or XML, or YAML, or any other kind of structured text format that isn’t supported natively by the database type system.) You can do queries on it, you can analyze it, you can put indexes on it, you can implement certain data consistency standards on it (e.g. a unique key on type, days_covered, and created_at that prevents you from accidentally inserting the same start for the same date range more than once). There are also opportunities to make things better as you discover problems. With text in a column, there’s not a lot else you can do with it when you discover problems.

This schema is also more flexible and manageable than separate tables for each metric. If you need to add a new stat, just start calculating it and add a new tag for it. If you want to support a new period of time, just make a tag for it and start storing it. No need for schema changes.

The Takeaway

If you came into this as an expert who loves relational databases and strongly typed schemas, I hope you can understand a little more the kind of problems that seem too hard to fit into the strictures of such a schema, driving beginners to want to put JSON in string columns.

If you came into this as someone who’s occasionally—or more than occasionally—thought the answer to all your problems was shoving JSON in a text column, I hope you now understand why that’s not really the answer to your problems. Or you learned that you need to switch to another data store that supports shoving JSON into text columns better, and now every company you work at, you’ll lobby to switch to Postgres or Mongo. Someday that might even lead you to a job as a MongoDB ambassador who gets to travel around to conferences telling everyone why shoving JSON in text columns is terrible, and they should instead shove JSON in MongoDB to solve all their problems. Just think; I could be starting careers here today.

Foxes and Hedgehogs

When I first started programming, which wasn’t actually that long ago, but it feels like ages because of the constant churn, languages were like drunks in bars: they all had a philosophy. Some of them made a lot of sense, like Ruby’s “Principle of Least Surprise”. Some of them were incoherent, like Perl’s “TIMTOWTDI” and “GALFKGAFCVHAJRTOAIADGDLFJKC”. Many were encapsulated in nice little slogans that people could yell at each other in forums. Others were implicit, like Java’s philosophy, “Make Insanely Complicated Libraries So Companies Have To Hire Exorbitantly Priced Consultants To Teach Their Programmers To Accomplish Basic Tasks In Java”. But whether the philosophy made sense or didn’t, whether the philosophy was implicit or explicit, the languages all had one.

The philosophies served a lot of social purposes, but in most cases they did begin as a motto or design principle that the language creators kept in mind when designing the language. Recently, the languages I’ve seen becoming popular, like Kotlin and Swift, don’t seem to have philosophies. They have purposes: Swift’s purpose is freeing iOS and macOS programmers from having to manually decrement reference counts, and Kotlin’s purpose is to integrate with Jetbrains IDEs. But they don’t seem to have language design philosophies the way older languages do.

A lack of a language design philosophy is sort of a language design philosophy. It means the creators don’t have an overriding vision for the language. But without an overriding vision, how do they decide which features get in and which don’t? I don’t know a ton about either Kotlin or Swift, but one thing always stuck out to me in tutorials on them: the tutorials read a lot like lists of features from other languages. “Swift has for-each loops. Swift has lambdas. Swift has classes. Swift has limited type inference.” They would explain how the particular implementation worked, but it was usually substantially similar to other well-known languages. Kotlin and Swift are not random assemblages of features the way languages like C++ and Perl 5 and PHP are. They have an aesthetic, which I’d call roughly “functional-lite”. They’ve settled on a Python-esque profile, largely imperative with object-oriented and functional features worked in around the edges, and added some Java-style static typing. I divined that they choose features based on what they think working programmers will find useful.

Of course, no popular language aimed at working programmers purposely adds features that are highly likely to be useless. But some languages are built according to a strong, overriding vision of what set of abstractions will ultimately prove useful to working programmers. Others are more pragmatic; they add whatever features have been proven, through long use in other languages or through long experience without the feature, to be practical and useful for working programmers. Older languages made this into a philosophy itself: Perl’s “There Is More Than One Way To Do It” was expressing a belief that there should be multiple syntaxes for the same abstraction, presumably because each one would feel natural in different situations. (I never worked with Perl, but the concept does make sense—the same abstraction can manifest concretely in different ways, each of which feel appropriate in different situations.) This distinction is sort of a meta philosophy on top of the various philosophies (explicit and implicit) of different languages. Swift and Kotlin are in the latter camp, but since they’re backed by large corporations and have special niches where their dominance is uncontested, they didn’t need to create marketing slogans that explicitly state it the way Perl did. They didn’t need to differentiate themselves from other languages with marketing; they differentiated themselves by being the language of Apple / Jetbrains and by being the language of iOS development / extreme IDE integration.

Once I realized this divide existed, it reminded me of foxes vs. hedgehogs. The idea comes from a snippet of Ancient Greek poetry attributed to Archilochus: “The fox knows many things, but the hedgehog knows one big thing.” To wit, the fox stays alive with many tricks to keep ahead of its enemies, whereas the hedgehog has one big trick, stabbing its enemies on its spines. (We’re excluding the possibility of a hedgehog that’s the fastest thing alive.) I first heard about it in politics. Some politicians dabble in many issues; they might have a few flagship issues, but they’re conversant with almost all of them, and they’re quick to educate themselves about new issues. Other politicians have one big issue that they care about more than anything, and are less concerned with other issues. Isaiah Berlin applied it to writers in his essay The Fox and the Hedgehog: some writers, like Plato, Dante, and Nietzsche have one big idea that colors all their work, while others, like Shakespeare, Joyce, and Goethe, write about a wide range of ideas and experiences. It’s the same with programming languages. Some languages are fox languages: they might have a few flagship features, but they’ll include various features from all over the place, and while they’re not necessarily garbage heaps of trendy features, they’re more open to borrowing. Some languages are hedgehog languages: they’re built around a singular vision. The vision might be towards a stance on a big philosophical issue. It might be a vision of a language built around a big abstraction. It might be a vision of a language perfectly suited for a single restricted domain. For a hedgehog language, everything that goes into the language serves its one singular vision.

Why is this distinction useful? Because new languages (and libraries, frameworks, platforms, etc., which can be judged much like programming languages) appear constantly, and they’ll all tell you that they can solve all your problems. Evaluating whether the tool’s meta-philosophy matches your goals is a good first step to finding out if that’s true. It’s not about one or the other being bad or wrong. Both have their uses.

If you think hedgehog languages are useless, think about this. Nobody likes writing multithreaded code. Well, okay, some people like writing it, but nobody likes debugging the insane synchronization issues that crop up later. Nobody likes spending a bunch of time crafting the perfect multithreaded architecture only to discover it has a fundamental flaw that makes all your code actually just really expensive sequential code. So a bunch of hedgehogs who had a vision of a better way to do this concurrency thing went off and created abstractions like software transactional memory, actors, communicating sequential processes, single-threaded event-driven programming, and others, and then a bunch more hedgehogs made languages like Clojure, Erlang, Scala, Go, and Node that baked in support for them as part of their visions. Another big victory for hedgehogs: OOP. Hedgehog languages like Smalltalk and Java went all in on OOP and convinced a generation of programmers that writing libraries in C was painful and bad. Whether you think OOP was a good idea or not (I’m starting to have my doubts nowadays), it was a huge propaganda victory for hedgehogs.

I hope it takes less convincing to see that fox languages have their uses. Their benefits are more obvious: programmers who’ve used the same features in other languages can jump right in and understand them without learning new abstractions. Knowledge about how to solve problems in other languages carries over without any trouble. The thing is, these seem to be cyclic. For a while, everyone just uses the fox languages and appreciates all their homey conformity and routine. Want to write a website? Here’s a fox language. Want to write a mobile app? Here’s another fox language that’s basically the same, except maybe you can leave parentheses off no-arg function calls. Want to program an Arduino? Here’s yet another fox language that’s basically the same as the first two, except maybe it has an easier dictionary syntax. Then some new problem appears. Reusable software. Distributed computation. Big data. Machine learning. Whatever. The old fox languages are missing something that makes this new problem easy to solve. So a round of hedgehog languages appears, built around a single grand vision of how to solve this new problem. Sometimes people adopt one of these hedgehog languages at large scale. Sometimes it survives multiple cycles by turning into a fox language. Other times, the new hedgehog languages are too inaccessible or don’t have enough marketing or whatever, but after a while a new round of fox languages appears that brings their features to the masses.

This is where I see Swift and Kotlin fitting in. Everyone was aware that Java 6 was making things harder. Everyone knew Java 6 didn’t have the right abstractions for modern, quick-churn web and mobile code. A round of hedgehog languages came out to address this—-Scala, Clojure, Go. They’ve gotten some adoption. Go is foxy enough, and backed by a huge enough company, that it might live to become a fox language. But the barrier to entry was high enough, and the lack of marketing profound enough, that most of these languages haven’t caught on in a big way. Not like Java, or Python, or PHP have. So now we’re having a round of fox languages—-Swift, Kotlin, Javascript after ES2016—-that consolidate features from all of them. We’re having languages like Java go foxy and add features from these languages to new versions.

So at this point, you might be saying “That’s great, but I’m a programmer who writes boring business apps. How do I figure out what I should be using?” I’m glad you asked! The distinction between fox languages and hedgehog languages is most important of all for you. Boring business apps have a particular set of constraints that are different from most other kinds of program. The code will be changed in a million places in a million small ways over its lifetime. A wide array of programmers should be able to understand the code, and legibility, documentation, and easy modification are the most important characteristics. There is almost never a good reason to use hedgehog languages for boring business applications. Hedgehog languages are based on a grand vision of how computation should be expressed. Boring business applications are not about grand visions of how computation should be expressed. Boring business applications are best expressed in the abstraction of structured programming, which is now such a fundamental part of programming languages that it doesn’t even matter whether a language is for foxes or hedgehogs, it does support structured programming. So, if you write boring business applications and you’re evaluating a language (or framework, database, etc.), you should look for signs of foxiness. Are the features widely supported in other, similar, existing tools? Do the creators stress practicality, usability, reliability? Are the overarching abstractions familiar, or at least widely available in other foxy tools? If you’re seeing a bunch of unfamiliar features, or familiar features mixed in unfamiliar ways; if the creators are talking about vision, about changing the way things are and blowing up the status quo and revolutions and paradigm shifts; or if the tool is based around abstractions that you’ve never seen before or only exist in very obscure tools, then you’ve got a hedgehog tool. It’s much harder to evaluate whether a hedgehog language is suited to what you need. Hedgehogs have to be carefully matched to a use case and a set of problems, because hedgehogs are usually created in response to a specific use case and set of problems.

I’ll end by listing off a few languages and evaluating whether I consider them fox or hedgehog, as they currently stand.

C: At the time it was made, very much a fox language. Now it’s something of a hedgehog language, in that its one big use case is programming in low-level, resource constrained, or high performance contexts. You probably wouldn’t choose to write your startup’s killer new social network in C.

C++: Definitely a fox language. C++ keeps on adopting new features like lambdas that proved popular in other languages. It’s mostly seen nowadays in similar contexts as C, but unlike C, it keeps piling on new features and new abstractions in response to programmer demand.

Java: Java started as something of a hedgehog language. It’s become less of one over time, but still looks spiny enough to me. It was object oriented programming for the masses, C++ without the rough edges, Smalltalk without the flexibility, and its adherence to object orientation still seems a bit militant. It’s gotten new features like generics and lambdas, but it’s always added them by putting a thin layer of syntax on top of an existing part of the language: generics just tell to compiler to automatically insert a cast for runtime, and lambdas are just syntactic sugar for anonymous classes implementing interfaces with one method.

Python: Python is a fox language with a few hedgehoggy principles, like “There should be one and only one obvious way to do it”, which leads to enforced whitespace conventions. Over time, it added features like classes, map and reduce, list comprehensions, iterators, generators, and async / await in response to other languages. It also made it relatively easy to call code written in C, which is the foxiest trick of all.

Python is also a good case study on communities vs. languages themselves. The Python language is a fox, but the community behaves like a bunch of hedgehogs. Python seems to usually resist adding any feature until people have been asking for it for three or four releases, and then someone will make a PEP and it’ll get added.

Ruby and Perl: These two are a lot like Python, but with even fewer overt hedgehog tendencies. They both borrowed things from bash, which is a very foxy language in that you only use when you have no other way to accomplish something.

Scala and Clojure: These two are the opposite of Python, Ruby, and Perl, which were foxy with a few hedghoggy traits. Scala and Clojure are hardcore hedgehogs with a few foxy traits, the biggest one being that they run on the JVM and can interoperate with Java code. Scala has a grand vision of mixing Java-style OOP with Haskell-style functional programming. Clojure has a majestic ambition to be a Lisp dialect for the modern era, mixing a solid concurrency story and good support for hash maps with the functional features of Haskell and the homoiconic power that Lisp is known for. Both of them have high learning curves due to lots of unfamiliar abstractions and unusual features, like Scala’s implicits and Clojure’s menagerie of types for managing state under shared memory concurrency.

Lisp in general is a hedgehog language. It has a grand vision of all programming as lists of lists that directly represent an abstract syntax tree. Common Lisp is a Lisp that tried to be foxy with stuff like rplaca and the progn macro. Scheme is a Lisp that skewed further towards the hedgehog. Clojure is somewhere in the middle; it runs on the JVM and adds features like hash maps, sets, and regex literals that are foxy compared to Scheme, but also encourages pure functional programming and immutability to an uncomfortable hedgehoggy extent. Its concurrency stuff is pure hedgehog—yes, even core.async. We’ll get into this in the next section.

Go: I’ve only used Go a little bit, but to my eyes it’s a language that lures you in by appearing to be a fox, but then when you go to pet it, surprise! You’ve got hedgehog spines stuck in your hands! It’s a hedgehog that knows how to play tricks like a fox.

There are two aspects to Go’s inner hedgehog. One is its concurrency stuff, with goroutines. Yes, I know it’s based on communicating sequential processes, from the 70’s. The thing is that there are at least two whole generations of programmers who’ve grown up never hearing about communicating sequential processes and never using a language that supported it. There’s the 1990s era of big enterprise Java developers, and there’s the current era of hipster startup Node hackers, plus a few intervening generations, all united in their never using a language that supported CSP as a first class abstraction. Since Clojure’s core.async is a ripoff of goroutines (as admitted by Rich Hickey himself), it’s also a hedgehog trait.

The other aspect is Go’s slavish adherence to C’s precedent. It has pointers. It has printf and scanf. It doesn’t have generics, and has repeatedly refused requests to add them, and justified this decision by appeals to Go’s design philosophy of simplicity. Whenever a widely requested feature is refused because of a design philosophy, that’s a sure sign of a hedgehog. I’m not saying whether that’s right or wrong, but it is a fact. C itself was foxy, but Go’s insistence on copying it beyond reason is total hedgehog.

My Biases

To disclose my own biases, I actually like hedgehogs, except when they go wrong, which they can, terribly, much more easily than foxes. Go’s hedgehog nature was what induced me to try it over something more foxy like Node or Kotlin. Clojure’s hedgehog nature induced me to try it over something more foxy like Scala. (Yes, Scala is also a hedgehog, but it looked deceptively like a fox from far away.)

Hedgehog languages are usually the ones that win over the people, but don’t get any traction in industry, and forums are filled with programmers wondering why no one can see how great this language is and what a killer secret weapon it would be for some team.

Even though I like hedgehogs, when the chips are down and the title is on the line, I always go back to foxes. If I have to program something and I have no idea what I’m doing, it’s back to Python. I don’t reach for Clojure or Go, because with hedgehog languages, you never know when you’ll run into some problem that, due to its very nature, doesn’t translate well into that language’s idioms and abstractions. If it turns out that a bunch of hideous index manipulation is the only way to solve the problem, I’d rather take my chances with Python than write weird Clojure code full of atoms and swap! and reset!.

Java is a hedgehog language. And a lot of pain in modern software has happened because people used Java somewhere where its insistence on object-oriented programming, combined with the militant adherence to object-oriented design principles that its programmers have developed, didn’t do a good job representing the problem domain. I work on a Java codebase at my job, and every day, I run into things that Java makes harder by being such a freaking OOP hedgehog. Testing, for example. You can write testable code in Java, but features like private access and lack of standalone functions are hostile to it, not to mention all the ways Java developers’ culture has evolved to make it harder to write testable code. This has led to tons of cultural conventions and weird libraries to make writing testable code possible, but still not easy. If the chips were down and the title was on the line, I would never turn to Java. It’s fairly likely if I pick Clojure that I’ll have to do some sequence manipulation that’s hard to translate into point-free applications of map, reduce, and filter, but it’s just as likely if I pick Java that I’ll end up wanting arbitrarily nested maps, or pairs, and have to run off and write new classes to represent them. And then I might want a 3-tuple. Guess what? Another new class!

On the other hand, when you have to explore a gigantic monolithic codebase written by 50 programmers over five years, Java’s not so bad. It forced the other programmers to explicitly write out everything, explicitly represent everything with classes and statically typed variables, and even tag which methods you need to worry about seeing again outside of that class. It’s all about knowing when to pick the right tool. Evaluating foxiness vs. hedgehoggosity is one way to get a feel for a tool. Then you can decide whether that particular hedgehog’s spines will solve your problem, or if you need the cunning of a fox to get through.

You Won’t Be Remembered For Your Code

A lot of people who are first getting into the software industry seem to think that they’re going to make some kind of impact with their code. The dumber ones think it’s going to be some kind of huge splash, like people are going to print their code on giant wall scrolls and put it up in museums so future programmers can come through and admire its structure and that amazing little trick with the lambda. The smarter ones know it will be a small impact, but they still think it will be an impact of some kind.

Allow me to rain on your parade: you will never be remembered for your code. Code in industry software is endlessly erased and rewritten. Even that picture of a penis you carved into the wall of a bathroom stall in sixth grade and then initialed like it was some great work of art is more likely to still be there, because replacing the walls of bathroom stalls is expensive and laborious. But erasing your clever little trick with the lambda and replacing it with ten lines of plodding that even the biggest dunderhead couldn’t possibly fail to understand takes about five minutes, and people who erase code are on salary, so it didn’t even cost the company five minutes of work, whereas people who replace bathroom stall walls get paid hourly. Plus, now even the biggest dunderhead couldn’t possibly fail to understand that code, so they can fire the programmer who replaced your clever little lambda trick and hire the biggest dunderhead for a lower salary.

No one is remembered for code. Steve Jobs is remembered for product design and marketing. Bill Gates will be remembered for being insanely rich. Programming languages are products, so programming language creators are also remembered as product designers, not for their wonderful code, and they’re easily forgotten once no one is using the language anymore; give it twenty years and Larry Wall will be a random piece of programming trivia alongside Jon Ousterhout, Jean Ichbiah, and Niklaus Wirth. Same goes for frameworks, which are even more disposable than programming languages. If anyone remembers David Heinemeier Hansson or Rails at all, it won’t be for some ingenious metaprogramming trick he used in Active Record.

And these are the really famous people, the people whose code actually affected the way thousands of others do their jobs. Even they will not be remembered for how wonderful their code was. So trust me when I say that you won’t be remembered for your great code.

However, if your code sucks, it will keep your memory alive for a while. Every time someone goes into a sloppily formatted file with dozens of unnecessary global variables, lots of overlong methods, and comments that say x++; // Add one to x, they’ll head on over to Git and find out who wrote this crap. If you wrote enough of the code, you might even become something of a legend for a while. Of course, the other programmers will make a point of extinguishing your imprint from the code, so your code is still headed for death, and your legend will live on only in the tales the other programmers tell of the crappy code they suffered with until they could sneak in enough refactoring to eliminate it. You’ll be remembered, but you’ll be remembered the same way Tom Green is remembered, as an irritating bore that everyone raced to expurgate as quickly as humanly possible.

So don’t get into software if you think code is a good way to be remembered. It’s not. Product design is more likely to get you remembered by someone, but that’s not foolproof either; not many people, even in programming circles, have heard of Ward Cunningham, who invented the wiki, or Dan Bricklin, who invented the spreadsheet, even though those are both products at least as important as anything Steve Jobs ever made.

JSON in Java: A Case Study in Pain

Dealing with JSON in most dynamically typed languages is dead simple: you have a function that can inflate a string into the language’s native hash map type, and you have another function that can serialize the language’s native hash map type into a string of JSON. These languages all support nested maps, some sort of list, and strings, numbers, and booleans well enough to automatically map from JSON’s simple data types into the language’s native types.

Dealing with JSON in the emphatically statically typed, the holdover that’s still hanging around like a bad hangover, the one, the only Java will eventually make you wish you were dead.

I’ll be up front with you: this isn’t going to be a deep or insightful read. It’s mostly going to be a rant, with lots of hate for Java and static typing, but mostly static typing as seen in Java, because I’ve never tried to work with JSON in languages like Haskell with more sophisticated type systems, so you Haskell lovers are free to stay smug in the knowledge that you work with a better class of type system, nay, not a class in sight, but rather a category. If you love Java and static typing, you might want to tune out, because I am not at all interested in hearing for the hundredth time how dynamic languages force you to write unit tests for the stuff that static languages check at compile time. This is also not meant as a top-to-bottom indictment of static typing; it’s an examination of how static typing utterly fails to solve one specific problem in an elegant way.

To start with, I’m going to present the problem. Then I’ll examine what I find to be the root causes of this problem, and then rant a little about some features Java will never get that I think could help this problem.

The Problem

There are two major things you need to be able to do with JSON in your programming language: turn a string of it into your language’s native data types so you can work with the data, and turn your language’s native data types into a string of it so you can send it across a network.

JSON—Javascript Object Notation—is based on a restriction of Javascript’s native data types. Looking at the standard, we see that it supports objects, arrays, strings, numbers, booleans, and null. On the face of it, it seems like we should be able to represent these types with native Java types: maps for objects, lists for arrays, and Java’s string, number, boolean, and null for those types. But maps and lists in Java have to be restricted to a single type, so to represent JSON values like {"status": true, "message": "Status normal"} or [23, 'hello', false], we have to use a Map<String, Object> and a List<Object>.

Unfortunately, this almost immediately leads to problems. Java lists and maps don’t support being stringified as valid JSON, so you have to write custom code to do it. Actually writing this code isn’t too hard as long as you remember that JSON supports arbitrarily nested lists and objects, so you have to make sure your serialization is applied recursively until you bottom out. But it has all the usual disadvantages that come when you write custom code for fundamental operations instead of getting it from a library: you have to test it, support it, and maintain it yourself, and it’s harder to bring into new projects since you can’t just let your build tool grab it from Maven Central.

Working with Objects in Java is rather painful; if you want to do anything with them, you have to cast them. This requires you, the programmer, to know what type every one of those objects actually is. Look at this rather ugly bit of code:

Map<String, Object> jsonMap = getJsonMap();
List<Object> jsonList = jsonMap.get("theList");
if ((Boolean)jsonList.get(2)) {
    Integer idNumber = (Integer)jsonList.get(0);
    User user = getUserById(idNumber);

Not only is this code ugly, it also loses the much-vaunted type safety of Java. You could get runtime type errors if it turns out that the first item of jsonList is not actually an Integer or the second item is not actually a String.

We can solve the ugliness problem a bit by throwing out the built-in Java map and list and using a special set of classes to represent JSON data types. This is what org.json or Jackson and GSON’s tree structure use. Instead of a Map, we’ll use a JsonObject; instead of a List, we’ll use a JsonArray; and instead of using Java strings, numbers, and booleans, we’ll use a JsonPrimitive. We’ll make all of these classes implement the JsonType interface.

The JsonObject class will behave like a Map<String, JsonType>. The JsonArray will behave like a List<JsonType>. The JsonPrimitive will store its value as a string, or possibly an Object. It will have a set of three methods: getAsNumber, getAsString, getAsBoolean, and possibly corresponding isNumber, isString, isBoolean. We glossed over the extra complication that the Javascript/JSON number type can be either integral or floating point; we also glossed over the extra complication that JSON numbers don’t have a specified size, so what we get could potentially overflow an Integer or Double. We could handle those problems here either by making getAsNumber return a BigDecimal or making separate getAsInteger, getAsDouble, getAsBigInteger, getAsBigDecimal methods. Or we could decide that overflow will be rare enough that we’ll tolerate errors when it happens, and make getAsNumber return a Double.

With this new set of classes, we can rewrite the above example:

JsonObject root = getJsonMap();
JsonArray array = root.get("theList");
if (array.get(2).getAsBoolean()) {
    User user = getUserById(array.get(0).getAsInteger());

It’s at least less ugly without all those casts and temporary variables, and since the types make it obvious that we’re dealing with JSON, we don’t have to name everything jsonWhatever. But we’ve still lost type safety; if we do array.get(2).getAsBoolean() and that item isn’t a boolean, we’ll still get a runtime type error. And these custom JSON types are inevitably less convenient to work with than the standard library data types. What options do we have if we want to iterate over a JsonArray? Does it implement Iterable<T>, which will let us use the “smart for-loop”, or are we stuck with the C-style for-loop? Can we get the entry set of a JsonObject and iterate over all its keys and values if we want to? Do these classes implement Collection<T>, making them streamable?

This solution with separate JsonWhatever classes is the quick and dirty way of solving the JSON problem in Java. Imagine that your car breaks down in the middle of the desert. You have a tarp, a roll of duct tape, and a long pole in the back, so you tape the tarp to the pole, tape the pole to your car, and you’ve got a sail. The JsonWhatever classes are the homemade car sail of JSON handling. It does nothing whatsoever to fix the fundamental problem, but it does kind of get you moving again, in a way that’s slightly better than walking or pushing the car.

Most high-powered JSON libraries in Java go with a more sophisticated approach, borrowed from ORMs such as Hibernate. The Java language really really wants you to specify everything precisely, categorically, and at a low level: this class has these exact five fields with this five exact names, and this one’s an Integer, and this one’s a String, and this one’s a NosehairInspectorFactory, and so on. So why not create a class that specifies what your JSON will look like, precisely, categorically, and at a low level?

This is what Jackson and GSON do. They include a class, called ObjectMapper in Jackson, that can take a normal Java object and turn it into a string of JSON, or take a string of JSON and a normal Java class and create an object of that class from the string. Here’s an example:

class MyObject {
    private Integer myId;
    private String myPreferredGreeting;
    private Boolean isHasBeenGreeted; // Note, this is what you get when you insist on starting all boolean names with "is".
    private MyOtherObject myOtherObject;
    private List<Integer> allMyBuddiesIds;

    // All that getter and setter nonsense that Java insists on...

Jackson can map between this JSON and an object of that class:

    "myId": 23,
    "myPreferredGreeting": "Whassup dorks?",
    "isHasBeenGreeted": false,
    "myOtherObject": {
        "myFavoriteDrink": "Rolling Rock",
        "myFavoriteFood": "Cheese pizza with the crust cut off",
        "myLifePhilosophy": "YOLO bros"
    allMyBuddiesIds: [21, 33, 49]

You might be wondering exactly how it does that. The answer is magic. As previously mentioned, I hate magic. But Jackson is magic from top to bottom. Except when it’s not.

The basic idea, to give a little more detail, is that Jackson can parse a JSON string into some intermediate form (which may or may not be the tree model it also offers, its version of our JsonWhatever types above; I haven’t looked into Jackson’s source code far enough to know). From there it can use reflection, possibly assisted by annotations you put on your class, to look for a way to map from the JSON object to your class. If you’re familiar with dependency injection, it’s pretty much the same idea: Jackson looks for a constructor or some setter methods it can use to inject the values from the JSON into an object of the class you provide. Then we have a regular Java object, referred to in this context as a POJO (“Plain old Java object”) since it’s not a special JSON-whatever object.

This sounds amazing. And it is pretty amazing, for JSON with all primitive types and no nesting. Unfortunately, Jackson doesn’t play well with generic types, so you end up having to use raw types for maps and lists, bringing us back to our original problem where we have to know what types are stored in maps and lists and cast them when we extract them.

The other problem with this is that some APIs return JSON responses that look like this:

  "statuses": [
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Mon Sep 24 03:35:21 +0000 2012",
      "id_str": "250075927172759552",
      "entities": {
        "urls": [

        "hashtags": [
            "text": "freebandnames",
            "indices": [
        "user_mentions": [

      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "Aggressive Ponytail #freebandnames",
      "metadata": {
        "iso_language_code": "en",
        "result_type": "recent"
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 250075927172759552,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_sidebar_border_color": "C0DEED",
        "profile_background_tile": false,
        "name": "Sean Cummings",
        "profile_image_url": "",
        "created_at": "Mon Apr 26 06:01:55 +0000 2010",
        "location": "LA, CA",
        "follow_request_sent": null,
        "profile_link_color": "0084B4",
        "is_translator": false,
        "id_str": "137238150",
        "entities": {
          "url": {
            "urls": [
                "expanded_url": null,
                "url": "",
                "indices": [
          "description": {
            "urls": [

        "default_profile": true,
        "contributors_enabled": false,
        "favourites_count": 0,
        "url": null,
        "profile_image_url_https": "",
        "utc_offset": -28800,
        "id": 137238150,
        "profile_use_background_image": true,
        "listed_count": 2,
        "profile_text_color": "333333",
        "lang": "en",
        "followers_count": 70,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "",
        "profile_background_color": "C0DEED",
        "verified": false,
        "geo_enabled": true,
        "time_zone": "Pacific Time (US & Canada)",
        "description": "Born 330 Live 310",
        "default_profile_image": false,
        "profile_background_image_url": "",
        "statuses_count": 579,
        "friends_count": 110,
        "following": null,
        "show_all_inline_media": false,
        "screen_name": "sean_cummings"
      "in_reply_to_screen_name": null,
      "source": "Twitter for Mac",
      "in_reply_to_status_id": null
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Fri Sep 21 23:40:54 +0000 2012",
      "id_str": "249292149810667520",
      "entities": {
        "urls": [

        "hashtags": [
            "text": "FreeBandNames",
            "indices": [
        "user_mentions": [

      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "Thee Namaste Nerdz. #FreeBandNames",
      "metadata": {
        "iso_language_code": "pl",
        "result_type": "recent"
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 249292149810667520,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "DDFFCC",
        "profile_sidebar_border_color": "BDDCAD",
        "profile_background_tile": true,
        "name": "Chaz Martenstein",
        "profile_image_url": "",
        "created_at": "Tue Apr 07 19:05:07 +0000 2009",
        "location": "Durham, NC",
        "follow_request_sent": null,
        "profile_link_color": "0084B4",
        "is_translator": false,
        "id_str": "29516238",
        "entities": {
          "url": {
            "urls": [
                "expanded_url": null,
                "url": "",
                "indices": [
          "description": {
            "urls": [

        "default_profile": false,
        "contributors_enabled": false,
        "favourites_count": 8,
        "url": "",
        "profile_image_url_https": "",
        "utc_offset": -18000,
        "id": 29516238,
        "profile_use_background_image": true,
        "listed_count": 118,
        "profile_text_color": "333333",
        "lang": "en",
        "followers_count": 2052,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "",
        "profile_background_color": "9AE4E8",
        "verified": false,
        "geo_enabled": false,
        "time_zone": "Eastern Time (US & Canada)",
        "description": "You will come to Durham, North Carolina. I will sell you some records then, here in Durham, North Carolina. Fun will happen.",
        "default_profile_image": false,
        "profile_background_image_url": "",
        "statuses_count": 7579,
        "friends_count": 348,
        "following": null,
        "show_all_inline_media": true,
        "screen_name": "bullcityrecords"
      "in_reply_to_screen_name": null,
      "source": "web",
      "in_reply_to_status_id": null
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Fri Sep 21 23:30:20 +0000 2012",
      "id_str": "249289491129438208",
      "entities": {
        "urls": [

        "hashtags": [
            "text": "freebandnames",
            "indices": [
        "user_mentions": [

      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "Mexican Heaven, Mexican Hell #freebandnames",
      "metadata": {
        "iso_language_code": "en",
        "result_type": "recent"
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 249289491129438208,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "99CC33",
        "profile_sidebar_border_color": "829D5E",
        "profile_background_tile": false,
        "name": "Thomas John Wakeman",
        "profile_image_url": "",
        "created_at": "Tue Sep 01 21:21:35 +0000 2009",
        "location": "Kingston New York",
        "follow_request_sent": null,
        "profile_link_color": "D02B55",
        "is_translator": false,
        "id_str": "70789458",
        "entities": {
          "url": {
            "urls": [
                "expanded_url": null,
                "url": "",
                "indices": [
          "description": {
            "urls": [

        "default_profile": false,
        "contributors_enabled": false,
        "favourites_count": 19,
        "url": null,
        "profile_image_url_https": "",
        "utc_offset": -18000,
        "id": 70789458,
        "profile_use_background_image": true,
        "listed_count": 1,
        "profile_text_color": "3E4415",
        "lang": "en",
        "followers_count": 63,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "",
        "profile_background_color": "352726",
        "verified": false,
        "geo_enabled": false,
        "time_zone": "Eastern Time (US & Canada)",
        "description": "Science Fiction Writer, sort of. Likes Superheroes, Mole People, Alt. Timelines.",
        "default_profile_image": false,
        "profile_background_image_url": "",
        "statuses_count": 1048,
        "friends_count": 63,
        "following": null,
        "show_all_inline_media": false,
        "screen_name": "MonkiesFist"
      "in_reply_to_screen_name": null,
      "source": "web",
      "in_reply_to_status_id": null
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Fri Sep 21 22:51:18 +0000 2012",
      "id_str": "249279667666817024",
      "entities": {
        "urls": [

        "hashtags": [
            "text": "freebandnames",
            "indices": [
        "user_mentions": [

      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "The Foolish Mortals #freebandnames",
      "metadata": {
        "iso_language_code": "en",
        "result_type": "recent"
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 249279667666817024,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "BFAC83",
        "profile_sidebar_border_color": "615A44",
        "profile_background_tile": true,
        "name": "Marty Elmer",
        "profile_image_url": "",
        "created_at": "Mon May 04 00:05:00 +0000 2009",
        "location": "Wisconsin, USA",
        "follow_request_sent": null,
        "profile_link_color": "3B2A26",
        "is_translator": false,
        "id_str": "37539828",
        "entities": {
          "url": {
            "urls": [
                "expanded_url": null,
                "url": "",
                "indices": [
          "description": {
            "urls": [

        "default_profile": false,
        "contributors_enabled": false,
        "favourites_count": 647,
        "url": "",
        "profile_image_url_https": "",
        "utc_offset": -21600,
        "id": 37539828,
        "profile_use_background_image": true,
        "listed_count": 52,
        "profile_text_color": "000000",
        "lang": "en",
        "followers_count": 608,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "",
        "profile_background_color": "EEE3C4",
        "verified": false,
        "geo_enabled": false,
        "time_zone": "Central Time (US & Canada)",
        "description": "Cartoonist, Illustrator, and T-Shirt connoisseur",
        "default_profile_image": false,
        "profile_background_image_url": "",
        "statuses_count": 3575,
        "friends_count": 249,
        "following": null,
        "show_all_inline_media": true,
        "screen_name": "Omnitarian"
      "in_reply_to_screen_name": null,
      "source": "Twitter for iPhone",
      "in_reply_to_status_id": null
  "search_metadata": {
    "max_id": 250126199840518145,
    "since_id": 24012619984051000,
    "refresh_url": "?since_id=250126199840518145&q=%23freebandnames&result_type=mixed&include_entities=1",
    "next_results": "?max_id=249279667666817023&q=%23freebandnames&count=4&include_entities=1&result_type=mixed",
    "count": 4,
    "completed_in": 0.035,
    "since_id_str": "24012619984051000",
    "query": "%23freebandnames",
    "max_id_str": "250126199840518145"

That’s from Twitter’s API reference. Now imagine writing a Java class to represent that response. Now imagine doing it again for whatever other insane responses Twitter wants to fling at you.

Jackson has what feels like hundreds of bizarre annotations and configuration options that let you control the translation process between a JSON string and your class, or get around problems like JSON responses suddenly getting new fields and mutually referential objects causing stack overflow errors due to infinite recursion. It’s a huge, complex library, and while there’s plenty of information out there, it can be hard to word a search query in the right way to find a solution to your specific problem. At my company, we briefly looked at migrating from org.json to Jackson, but it was going to be a massive amount of work to figure out the right magical incantations to make Jackson not break and then either go mark up all our database objects with the right annotations or write new view-model classes to use for JSON translation. And since none of us were Jackson experts, we had a steep learning curve ahead of us.

Imagine your car breaks down in the middle of the desert, but you happen to be right next to an abandoned auto shop with a set of books and tools. You don’t know the first thing about cars, other than where the gas goes, but you figure you’ve got a good brain, so you sit down with a book and start reading. After six months of alternately reading and tinkering on some of the forgotten wrecks left on the hydraulic lifts of this abandoned auto shop, you’re ready to start on your car. You eventually get it kind of working, though it can’t go above 37 mph, makes a weird stuttering noise as you drive, and you have to stop every 15 miles and go tighten some bolts, and also you just spent six months camped out in an abandoned auto shop in the middle of the desert. Here you’ve addressed the fundamental problem that your car is broken, and what you came up with is definitely better than a homemade car sail, but was it really worth that six months you spent in that abandoned auto shop?

The root cause, and how to fix it

The root cause of all these problems is Java’s static typing. Well, to be fair, it’s also Javascript’s dynamic typing. JSON is based on Javascript, so it’s perfectly fine with lists of mixed types, maps of string to whatever you please, and arbitrarily nested self-referential objects. Languages like Python, Ruby, PHP, and Clojure, which are dynamically typed and have a similar set of primitives to Javascript, are also fine with that, so you can trivially map any JSON object to a native data type in those languages.

Java is not fine with any of that. It demands lists of a single type and maps of strings to a single type, and it asks that you please tell it beforehand just how deep you’re going to nest those self-referential objects. You can cheat your away around this by declaring the type as Object, but working with Objects is tedious, ugly, and loses you the only advantage to having all these restrictions.

How can we fix this? Here a few features that would definitely help. They’re listed in order of how likely they are to be added to the Java language, starting at “Maybe in Java 26” and ending at “You’ve got a better chance of convincing your company to rewrite everything in Erlang”.

Optional dynamic typing

C# supports a type keyword dynamic. This makes a variable dynamically typed; the compiler basically pretends that it can be any type and supports any operation, and allows a runtime exception to be thrown if you do the wrong thing with it.

This would, at the very least, make the JsonWhatever strategy a lot easier: instead of needing a JsonPrimitive with all those conversion methods, just make the JsonObject and JsonArray store values of type dynamic.

This would also make Jackson’s problem with generic types easier to deal with: instead of binding to lists of Object or maps of string to Object, it can bind to a list of dynamic or map of string to dynamic.

This approach still gives up type safety, but it makes giving up type safety a lot easier to work with and less painful.

Union types

Languages like Haskell and ML support something called union types. Essentially you can define a new type which can hold a value of any of a given set of types.

Imagine that Java had a syntax like type JsonVal = Map<String, JsonVal> | List<JsonVal> | Integer | Double | String | Boolean | null to define a new type called JsonVal. This declares that a JsonVal can be any of those listed types, but no other. If our code ever tries to assign a FizzingWhizbeeFactory to a JsonVal, the compiler will complain.

This can buy us both freedom from casting and arbitrary nesting. The compiler has a list of possible types, so it doesn’t need us to tell it exactly which one a given value is; if we call a method on a JsonVal, it knows where it can look for that method, and if it can’t find the method anywhere, it can flag a compile-time error. It doesn’t need us to cast for it. And the JsonVal type is defined recursively: a JsonVal can be a Map<String, JsonVal> or a List<JsonVal>. This gets us the arbitrary nesting we need to accurately represent JSON.

There are other challenges to getting union types right in Java. For instance, if I have a union type type MyType = A | B and C is a subtype of B, can MyType be an instance of C? Can I cast an instance of C to B and assign it to MyType? Can I then downcast an instance of MyType which is assigned an instance of B to C if that’s what it really is? Or do we just do what generics did, throw the whole mess out, and say that any type which isn’t explicitly listed in the declaration of the union type is invalid? But if we could get union types in Java, it would go a long way towards solving our JSON problem in a type-safe way.

A Dedicated JSON type

This is the easiest option for programmers, and the most deranged from a language design standpoint.

Scala, Visual Basic, and Kawa (a dialect of Scheme on the JVM) have XML literals. In Scala, for example, you can write this:

val people = 
        <person firstName="Ted" 
                gender="M" />
        <person firstName="Jeffery" 
                gender="M" />
        <person firstName="Charles" 
                gender="M" />
        <person firstName="Rasmus" 
                gender="M" />

So in Java, we could just modify the language to let you write this:

Json someJson = {
    "people": [
        { "firstName": "Adolf", "lastName": "Hitler" },
        { "firstName": "Idi", "lastName": "Amin" },
        { "firstName": "Pol", "lastName": "Pot"},
        { "firstName": "Hiro", "lastName": "Mashima"}

The Json type can have a static method, Json.fromString, that inflates a string into JSON, and its toString method can be overridden to serialize a Json into a string.

Scala can sort of do XML literals as part of the normal language, because Scala’s design is already deranged. In Java, the language would have to be modified. But there is precedent for this: arrays in Java are already just random tacked on garbage with special syntax and special behavior that applies to no other part of the language. And Java’s principal use today is as a server-side web programming language; given how important JSON is to web programming, it might be worth modifying the core language to add this.

Still, this is the least likely of the options. Adding lambdas and method references was a massive deal, and those had already proven their worth in dozens of other languages over a period of almost fifty years. Java is not a language that likes revolutionary features; it prefers to wait until every language in the universe, even C++, has a feature, and then panic and squeeze something on top of the existing language features that’s just good enough to disunite the opposition.


Dealing with JSON in Java is quite painful, and it’s because JSON is inherently dynamically typed due to its heritage in Javascript’s native object type, and Java’s static type system is too rigid to accommodate it in an elegant way.

Since JSON is used for communicating between systems and components of a system, the difficulty of working with JSON in Java can affect how you choose to design your API responses. You might choose to favor flat objects over nested, as we used to do at my company. Instead of something like this:

    "room": {
        "name": "library",
        "carpetType": "Persian rug"
    "killer": {
        "name": "Colonel Mustard",
        "weapon": "lead pipe"

You might choose this:

    "roomName": "library",
    "roomCarpetType": "Persian rug",
    "killerName": "Colonel Mustard",
    "killerWeapon": "lead pipe"

In Java, with Jackson, the second form requires a single class with four strings. The first form requires a class with two subclasses, each with two strings, and probably a few annotations to make sure everything happens correctly when you bind from a string. So up front, the first form is more complicated to work with in Java, and this extra complexity just gets more pronounced as your JSON responses get more complex.

But using the second form causes other problems. Since JSON keys have no guaranteed order, you could send the response to a client like a browser and have it print with the keys all randomized, which makes the response much harder to read if you need to debug. If a client of your API only cares about the room, they have to loop over all the keys and find the ones that start with room instead of being able to just grab the room object. You can write that annoying code every time, or you can write some stupid little helper function that you have to document and test to loop over a JSON object and find all the keys with a certain prefix. You’ll probably also have to duplicate this function across various code bases in your stack. This tradeoff might be worth it if you just have a bunch of Java code communicating (e.g. an Android app and a Spring API), but if you have other languages in your stack, like Javascript or Python, you’re forcing them to conform to the lowest common denominator.

If you’re writing web code nowadays, JSON is a fact of life, and it’s a great transmission format: it’s simple, readable, and powerful, easy to parse, able to represent almost any kind of data in human readable format. It’s elegant and easy to use, a big improvement in a lot of ways over RPC formats like COM and CORBA, unreadable binary formats, and XML. It’s a travesty that Java has such terrible support for something so elegant, widespread, and fundamental.