# A Comparison of Four Algorithms Textbooks

At some point, you can’t get any further with linked lists, selection sort, and voodoo Big O, and you have to go get a real algorithms textbook and learn all that horrible math, at least a little. But which book? There are tons of them.

I haven’t read every algorithms book out there, but I have read four of them. Maybe my experience with these four can help guide your decision. The four books are Algorithms, by Dasgupta, Papadimitriou, and Vazirani (hereafter called Dasgupta); Introduction to Algorithms, by Cormen, Leiserson, Rivest, and Stein (hereafter called CLRS); The Algorithm Design Manual, by Steve Skiena (hereafter called Skiena); and The Art of Computer Programming, Volumes 1-3, by Donald Knuth. I’ll do a five-point comparison, going over the prose style, code use, mathematical heaviness, breadth and depth of topics, and position on the continuum between theoretical and practical of each book.

There’s one thing you should know before I start: Dasgupta is available for free online, while Knuth and CLRS aren’t (well, they probably are, but not legally). So that might make your decision for you. I haven’t been able to determine if Skiena is legally available online. The book is posted in either PDF or HTML on a few legit-ish looking sites, but Skiena’s own page for it doesn’t mention anything about it being freely available, so proceed with caution.

## Does it have to be one of these four?

Not at all. There are lots of good algorithms books that I’ve never read. You can even learn a lot by just reading CS.SE and Wikipedia. But these four are the ones I’ve personally used. Many people like one of these four, but they do reflect my taste and biases. But even if you decide to go with a different book, this overview might at least tell you what features to notice.

## Prose style

A good algorithms book will usually explain its topics in three ways: with a natural language prose explanation, with some kind of implementation code, and with mathematical notation. When you’re starting out, the prose is usually the most important part, because it takes you from “What the hell is a binary search?” to “Got it, how do I write actual code for a binary search?”

For me, the winner on prose style has to be Knuth. He’s just a masterful writer. You might have to read some of his sentences twice, but that’s only because he gets across twice as much information as a lesser writer. Knuth’s prose explanations got me to understanding on several topics I’d despaired of ever getting with other books, like B-trees and merge sort.

Skiena is also an excellent prose stylist. Where Knuth is elegant and flowing, like the John Milton of algorithms, Skiena is direct and sharp, like the Ernest Hemingway of algorithms. Knuth and Skiena are good at different things: Skiena is good at finding a simple, direct way to explain things which are traditionally made more complicated than they need to be. His overview of Big O is amazingly clear and simple, as long as you have at least some memory of calculus. On the other hand, some topics are inherently really complicated, and this is where Knuth shines: he’ll give you several different ways to view a complicated topic, and chances are at least one of them will work for you.

Knuth is much quicker to jump to math, while Skiena mostly eschews math and tries to keep everything intuitive. We’ll discuss that more in the section on mathematical heaviness.

Dasgupta has a pretty good prose style too, more patient and beginner-friendly than either Skiena or Knuth. Sometimes, I found Dasgupta too verbose and felt the authors belabored the obvious too much, but other times that was an asset rather than a weakness (e.g. in the material on Fast Fourier Transforms).

CLRS probably has the weakest prose style of these four books. It’s closer to the standard math text style, written with a formal and distant air, sometimes favoring precision over clarity. It’s strictly functional, but tolerable if you already have some understanding of the topic.

## Code use

Knuth’s big weakness is code. He uses two different notations, pseudocode and MIX assembly language, his imaginary assembly for his imaginary computer. I find both of them extremely difficult to follow.

The problems with MIX should be pretty obvious: it’s assembly language, so it’s just hard to read. Not only that, it’s pretty different from the MIPS, ARM, or x86 assembly that modern readers might have seen. It’s designed to be run on either a decimal or binary architecture and assumes a six-bit word. Knuth put a ton of effort into creating this imaginary machine and its assembly language. Since it’s made up, MIX assembly is still technically pseudocode; but MIX is to pseudocode as Quenya is to pseudo-languages. Newer editions of TAoCP use MMIX, which was redesigned to reflect modern assembly languages. I haven’t seen it yet, but I imagine it’s still hard to read since it’s assembly language.

Knuth also uses high-level pseudocode, but I find that hard to read too because it’s organized like an unstructured language with goto statements. If I were planning to implement the algorithms in an unstructured language, it would probably be fine, but I’ve always found that there’s a nontrivial translation process between Knuth’s unstructured pseudocode and structured pseudocode suitable for implementation in a modern language.

CLRS and Dasgupta both use high-level pseudocode that resembles Pascal, although not too slavishly. CLRS expresses some things as if they were methods or fields of an object, in a semi-object oriented way. Skiena also does some of this, but in addition he uses real C code.

A lot of modern books use C or Java throughout, which might be more to some people’s tastes. These books reflect my taste, and I like my algorithms language-independent. I don’t mind how Skiena uses C—he uses it mainly to show how to implement something when there’s a significant gap between theory and practice. But I’m glad he stuck to pseudocode for the most part.

## Mathematical Heaviness

In this category, going from heaviest to lightest, we have roughly Knuth > CLRS > Dasgupta > Skiena. Dasgupta and Skiena are pretty close, while there’s a significant gap between them and CLRS. There’s a minor gap between CLRS and Knuth, and it really depends on the topic, but CLRS just didn’t take on as many inherently mathematical topics as Knuth did (especially in Knuth’s Volume 2, Seminumerical Algorithms, where he deals with random number generation and efficient arithmetic).

In general, Dasgupta is geared towards people (like students) who just recently learned a little math and are going to learn a little more, but haven’t internalized it all yet. Skiena is geared towards people (like working programmers, or graduate students) who’ve learned some math and forgotten a lot of it. Skiena is also student friendly, but he stresses practicality more than anything, and sometimes, in practice, you gotta do some math.

CLRS is also for students, but it’s more at the graduate student level, especially the later “Selected Topics” chapters that can sometimes get pretty heavy. Knuth seems more geared towards graduate students near the end of their Ph.Ds and practicing researchers.

All four books require you to know some combinatorics and a little number theory, and to have at least a vague memory of limits and function growth from calculus. That’s about it for Skiena and Dasgupta, though Skiena seems to expect you to have internalized the math a little more than Dasgupta does. The basic chapters of CLRS are just a little beyond Skiena in math background, but some topics get quite heavily mathematical. Knuth gets pretty heavily mathematical in almost every topic.

I’ve been conflating two things in the discussion so far, since they’re sort of linked: actual knowledge of theorems and axioms in mathematics, and the ability to read mathematical writing and proofs and follow an author’s reasoning (what’s often called “mathematical maturity”). Knuth requires the most technical knowledge of mathematics of the four here, and he’ll also stretch your reading ability pretty far; but CLRS will also stretch your reading ability, and Dasgupta is no cakewalk either, although I do think Dasgupta qualifies as a picnic. Skiena doesn’t have any proofs, but he kind of sneaks in proof-like things with his “war stories”, which usually read quite a bit like proofs, extended with discussion of some of the approaches he tried that didn’t work out.

If you’re getting your first algorithms book and you’re a CS undergrad, Dasgupta or Skiena is easiest to follow. CLRS will stretch you a little, but it’s manageable. If you’re a math or physics undergrad, CLRS shouldn’t be too hard and Knuth might be doable.

## Breadth and Depth of Topics

Dasgupta is the loser here; not only does the book cover fewer topics than the others, the topics it chooses to cover are poorly organized and somewhat eccentric. I’m not sure why the authors threw in the awful chapter on quantum computing at the end; it’s totally incomprehensible unless you already understand quantum mechanics, which is no mean feat.

CLRS has the best balance of breadth and depth: it covers basic data structures; some more advanced ones like red-black trees, B-trees, and the union-find data structure; algorithmic paradigms like greediness and dynamic programming; graphs, shortest paths, and network flows; sorting; amortized analysis; and an assortment of other topics such as number theoretic algorithms, cryptography, linear programming, string matching, and NP completeness.

Skiena covers about the first third of CLRS, but he does a lot more with NP complete problems and how to design approximation schemes for them than CLRS does.

Knuth, of course, is the master of depth. Volume 1 covers math background and fundamental data structures; Volume 2 covers random number generation and arithmetic; Volume 3 covers searching and sorting, going through various sort routines and some more advanced data structures, such as B-trees, as well as developing the whole theory behind hash tables and how to choose hashing functions and table sizes; and Volume 4 covers combinatorial algorithms. Volume 4 is split into three subvolumes; right now, only Volume 4A has actually come out.

If you want to get just one book, I would get Skiena or CLRS. They include all the most important topics both in practice and for undergraduate classes. Skiena, as a bonus, even includes some computational geometry material, in case you’re doing video games or computer graphics.

## Theoretical vs Practical

Skiena is the most relentlessly practical of this bunch. Knuth was actually pretty practical at the time Volume 1 came out, but he became less and less practical as time went on because people stopped writing things in assembly or using tape drives. Both give you implementation tips to make your code perform better, although Knuth’s can be hard to capitalize on since you’re not writing in assembly.

CLRS and Dasgupta are both theoretical in the sense that they mostly ignore the computer. Everything in CLRS will work, but sometimes their approach is too “straight from the theory”, as I discovered when I implemented Karp-Rabin according to their discussion and put it on Code Review.SE after struggling with numerous overflow-related bugs, only to have someone suggest a different approach that rectified all the performance issues and handled overflow elegantly.

Dasgupta and CLRS are both still good books, and if you’re just starting out learning algorithms, don’t get too caught up on this issue. Write your code in Python or Ruby, some quick and easy language that also ignores the metal. You’ll get the idea of the implementation, and you can deal with the issues around implementing it in some other language later. (Python even looks like the pseudocode in CLRS.)

## Conclusion

I probably use CLRS the most, because of its breadth of coverage and because its pseudocode is so much easier to follow than Knuth’s. If I don’t understand the concept or the mathematical underpinnings, I’ll go to Knuth for the deep discussion, and the excellent prose style.

I just got Skiena recently, but in the future, I expect him to usurp CLRS somewhat. He covers most of the same material, but his prose style is more direct and his approach is more practical. Skiena is excellently organized and has a catalogue of problems and their algorithmic solutions in the second half of his book, good for browsing when you’re stumped.

I don’t use Dasgupta that much. Dasgupta was the text for my undergrad algorithms class, and while it was good in that capacity, it’s not really a book that continues to be useful once you’re past the first-semester course, mainly because of the lack of breadth in coverage and the eccentric organization and choice of topics.

# What the Heck is Dynamic Programming, and Why Should I Care?: Being the first in a series on dynamic programming

Dynamic programming. It sounds like one of those Agile methodologies, like Extreme Programming. If you look into it expecting to find project management philosophies, though, you’ll quickly become horrified by the amount of weird-looking math that’s involved. Still, quantitative measures of productivity are good, right?

Dynamic programming is actually not dynamic, and it’s not programming, and it’s definitely not an Agile methodology. The name ‘dynamic programming’ is one of the worst, most misleading names in computer science, right alongside ‘computer science’, which makes non-computer scientists think that we fix people’s computers, but we’re really pretentious about it and think fixing people’s computers puts us on the same level as Albert Einstein and Jonas Salk. But the only other name considered for dynamic programming was ‘bottom-up recursion with memoization’, and that’s about as imposing as it gets, right alongside ‘first order homogeneous linear partial differential equation’, or the ‘infinite-dimensional Banach spaces’ that Guy Steele said in Coders at Work that he couldn’t ever get the hang of.

Still, the name ‘bottom-up recursion with memoization’ at least has the advantage of being accurate, so let’s start with that and try to make it less imposing. After we’ve explored the many parts of that sadly unused name, I’ll sketch out a problem of the type that dynamic programming might be good for. With that example in hand, you’ll hopefully have a better understanding of all the parts. Then I’ll talk a little about why you should care.

## Bottom-up recursion with memoization

Dynamic programming is recursion. Whether that helps you or not depends on how much you like recursion (most people don’t much like it).

Recursion tends to work on problems with recursive structure, i.e. a structure where big problems can be cut into smaller problems with the same form, which can be cut into even smaller problems, down and down and down, until you reach problems so small that, hey, you know the answer! Recall the factorial: $n! = n(n-1)(n-2)\cdots3\cdot2\cdot1$. This function also has a recursive definition: $1! = 1; n! = n(n-1)!$. This says that the factorial of 1 is 1, and the factorial of any other natural number $n$ is $n$ times the factorial of $n-1$. So $4! = 4(3!) = 4\cdot3(2!) = 4\cdot3\cdot2(1!) = 4\cdot3\cdot2\cdot1$.

Dynamic programming is a technique for solving optimization problems. An optimization problem is one with lots of correct answers, but with some answers better than others according to some measure other than right vs. wrong. Finding the shortest path in a graph, finding the longest increasing subsequence of a sequence of numbers, and pricing some items to get maximum profit are all optimization problems. They involve words like “shortest”, “longest”, “maximum” that imply some kind of measure of quality on the answers you find. And all of these problems can be solved with dynamic programming.

## With memoization

Memoization is actually the easiest part of dynamic programming to understand. If you understand hash tables, you understand memoization. You don’t even have to understand how hash tables work; you just have to understand the idea that they can look up stuff really fast.

In normal recursion, your intermediate results are all stored on the call stack. Let’s look at our good friend the Fibonacci sequence in Python.

def fib(n):
"Calculates the nth Fibonacci number."
return fib_helper(0, 1, n-2) # 0 and 1 are the first two.

def fib_helper(a, b, n):
"Does the actual recursion for fib(n)."
if n == 0:
return a+b
else:
return fib_helper(b, a+b, n-1)

That works, but what if we need to calculate all the Fibonacci numbers between 1000 and 2000? Then we might end up writing something like this:

for i in range(2000):
f = fib(i)
if 1000 < f < 2000:
print(f)

Every time we call fib(i) inside that loop, it calculates the Fibonacci numbers from 0 to i, even though the first i-1 Fibonacci numbers haven’t changed at all since the last time we calculated them. We can’t start at the first Fibonacci number above 1000 because we have no idea what value to give fib to get it. Even if we looked it up somewhere, we’d still end up calculating all the Fibonacci numbers below it a bunch of times and then throwing them away.

This is where memoization comes in. Because looking things up in hash tables (or lists, or arrays) is so fast, we can save a lot of time if we save our calculations in a data structure and look them up when needed. We only calculate a value if it’s not stored in the data structure already.

Here’s a memoizing version of the Fibonacci numbers.

memo = {0: 0, 1: 1}

def fib_memo(n):
"Calculates Fibonacci sequence with memoization."
if n in memo:
return memo[n]
else:
memo[n] = fib_helper(0, 1, n - 2)
return memo[n]

def fib_helper(a, b, n):
if n <= 0:
return a + b
else:
return fib_helper(b, a+b, n-1)

We first declare a memo, a dictionary whose keys are values of $n$ and whose values are the values of $F_n$, the $n$th Fibonacci number. Whenever fib_memo is called, it checks the memo to see if we’ve ever calculated $F_n$ before and stored it there. If so, it just returns that stored value. If not, it calls fib_helper to calculate $F_n$ and stores it in the memo for future reference.

Now when we run the code

for i in range(2000):
f = fib_memo(i)
if 1000 < f < 2000:
print(f)

then instead of recalculating a bunch of Fibonacci numbers each time, we calculate them once, and every time we need them after that, we get them out of a dictionary. Dictionary lookups are super-fast; Python implements its dictionaries as open-address hash tables in C, which means that looking something up is basically a single array access.

That’s pretty much all there is to memoization; it just means storing stuff in a data structure with fast lookup in case you need it again, so you can look it up instead of recalculating.

## Bottom-up recursion with memoization

Most common cases of recursion that you see in programming classes are cases of top-down recursion. This means that you start with a big instance of the problem and cut it down until you get instances that are easy to solve.

For instance, with the factorial, you start with a big instance, $n!$. You cut it into a smaller instance of the problem by rewriting it as $n \cdot (n-1)!$—you replace an instance with a big factorial by an instance with a smaller factorial. Then you cut down $(n-1)!$ by replacing it with an instance that has an even smaller factorial, $(n-2)!$.

Mergesort is another example of top-down recursion; you start with a big array and cut it in half. Then you cut the halves in half. Then you cut the fourths in half, and so on, until you have a bunch of arrays with one element, which are already sorted, because there’s nothing to compare them to.

But think about how the Fibonacci function above worked. We didn’t start with a big instance and cut it down into smaller instances. Instead, we started with two small instances—calculating $F_0$ and $F_1$, which we already knew were 0 and 1—and built them up into a bigger instance. We could do this because of the formula $F_n = F_{n-1} + F_{n-2}$, which tells us how to get a bigger Fibonacci number from two smaller ones. In the fib_helper function, in order to know when to stop, I introduced the parameter n, which counts down. When it reaches 0, the recursion stops.

This type of recursion, where you start with small instances and build up bigger ones, is called bottom-up recursion. Dynamic programming algorithms are pretty much all bottom-up recursive, and the reason is optimal substructure. If a problem has optimal substructure, it means that the best solution to a big instance can be computed using the best solutions to smaller instances. We need to know the results from smaller instances to calculate the results of bigger instances.

In the Fibonacci number problem, the formula for $F_n$ depends on the previous two numbers, so to calculate $F_n$, we have to have $F_{n-1}$ and $F_{n-2}$. In some problems with optimal substructure, the solution to an $n$-sized instance depends on the solutions to all the smaller instances, from $n-1$ all the way down to 1.

Many problems can be solved with either top-down or bottom-up recursion; for instance, we could calculate $n!$ using bottom-up recursion with a function like this:

def bottom_up_factorial(n):
return factorial_helper(1, 1, n)

def factorial_helper(a, x, n):
if a == n:
return x
else:
return factorial_helper(a+1, x*(a+1), n)

And here’s a function that calculates Fibonacci numbers with top-down recursion:

def fib(n):
if n == 0 :
return 0
elif n == 1:
return 1
else:
return fib(n-1) + fib(n-2)

You can even memoize top-down recursion, although it doesn’t always buy you anything. I don’t know of any problems that really benefit from using top-down recursion with memoization over using dynamic programming.

## Bottom-uprecursion with memoization

The hook that ties together these three things—bottom-up, recursion, and memoization—is optimal substructure. This will probably become more clear next time when we look at a problem, but for now just try to understand as much as you can.

Having optimal substructure implies that instances of a problem are somehow composed of smaller instances of that problem. This leads us to recursion.

In most of the problems solved by dynamic programming, the same smaller instances of the problem get examined over and over again when we find the answers for larger instances. We use memoization to make this more efficient—instead of recomputing the answers every time we need them, we memoize them the first time and just look it up after that.

We benefit from using bottom-up recursion in this situation for two reasons: it’s more efficient, and it’s easier to understand. If we start with the small instances and calculate their solutions, then we have all those solutions memoized. When we start working on a bigger instance, we know all the smaller answers we need are already stored in the memo. If we did top-down recursion, it might be the case that some of the smaller solutions we needed hadn’t been memoized yet. Then we would have to stop and calculate them. We’d need extra checks to make sure we had the solution, and extra recursive calls to calculate ones we didn’t have, and the program flow is a lot messier. With bottom-up recursion, every value we need is already memoized, so it’s fast and the program flow is cleaner.

That’s vaguely why problems with optimal substructure are so nicely solved by using bottom-up recursion with memoization. Next time we’ll go over some example problems, and hopefully this will all be more clear.

## Postscript

I want to talk a little about why you should care about dynamic programming.

Dynamic programming is an advanced algorithm design technique. If you don’t think you need it, you probably don’t. But there are some very nasty problems that we really want to solve, and dynamic programming turns out to be perfect for them.

Weirdly, dynamic programming turns out to be a really good technique for optimization problems on strings. If you’ve ever used TeX, you might have noticed that it makes really nice paragraphs for you when you output to PDF. In Digital Typography, Donald Knuth talks a little about his algorithm for breaking up text into lines that form nice paragraphs. He frames it as an optimization problem: given a big mass of text, there are tons of ways you can choose to break the text into lines that form a paragraph. The problem is to find the “best” way. Knuth invented a way of measuring the “badness” of putting a break in a particular spot. The badness depends on things like how much extra space would need to be added to make the line stretch to the edge of the margin, and whether you have to split a word in half with a hyphen to put a break there. The algorithm uses dynamic programming to build lines, bottom-up, and finds the paragraph with the smallest badness over the line breaks.

Another place where dynamic programming is useful is bioinformatics. This field involves the analysis of genetic sequences gathered in biological research. DNA is made of molecules called nucleotides, strung together into chains. The structure comes from the nucleotide’s nitrogenous base; nucleotides can have one of the four bases adenine, guanine, cytosine, and thiamine. RNA is similar, but it has uracil instead of thiamine.

We can represent a strand of DNA as a string of the letters A, G, C, and T. These strings can be megabytes or gigabytes of text. (You can think of these strings like biological machine code; instead of each bit having the two states, 0 and 1, they have four states.) There are various tasks we might want to do with these strings, like searching for matches between them—maybe we’ve found a particular gene in gorillas, and we want to know if humans also have it.

Dynamic programming is an excellent way to analyze these genetic strings. If you’re interested in learning more about algorithms on genetic data, the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, is very good. Gusfield taught my undergraduate algorithms class at UC Davis and is responsible for most of what I know about dynamic programming; he’s an excellent teacher, good at delivering clear explanations, and somewhat well known in the bioinformatics field, and the book is just as clear as his lectures, and beautifully typeset.

(I’d like to say I was hanging out with Gusfield and soaking up his knowledge while I was in his class, but the truth is we never spoke. I had a two-hour drive to get home, and his office hours were at six at night.)

Anyway, those are a few reasons why you should care, even if you probably won’t ever have to invent a dynamic programming algorithm.

# Second Ode on Induction: The Natural Numbers

Last time we went over an example of induction. Yes, even after I promised that we’d dispense with the examples and really dig in to the meaning behind induction. I did that because I wanted to show you something a little more complicated and a little more computer sciencey than the usual examples you see in discrete math for programmers, like proving that $\sum_{r=0}^{n}r = \frac{n(n+1)}{2}$. Binary trees are fairly simple and very computer sciencey, so I thought they’d be nice. I know I didn’t see a decent induction proof involving a data structure until my upper-division algorithms class, and that’s just too late, in my opinion.

Anyway, I said last time that the three parts of induction could be explained in terms of an argument you make to your friend when he says he was abducted by aliens who stripped him naked and looked at him. Like this:

1. Base case: “Here’s an alien from the species you say abducted you, let’s call them the Borg. This Borg says he personally would never travel across the entire galaxy from the Delta Quadrant just to see you naked.”
2. Inductive hypothesis: “Let’s assume that if you polled $n$ Borg drones, they would agree with this drone—they wouldn’t fly across the galaxy to see you naked.”
3. Inductive conclusion: “Given that $n$ drones concur, $n+1$ drones concur. Therefore the entire collective concurs. The Borg would never fly from the Delta Quadrant so they could abduct you and look at you naked.”

Why does the inductive conclusion work? Why can we assume that $n$ concurring drones means $n+1$ concurring drones, and that this property means a unanimous collective? If you’ve seen Star Trek, you know that it’s because the Borg have a hive mind. If one drone holds an opinion, that opinion is shared by the entire collective. But why does this work for induction?

## The Natural Numbers

The natural numbers are just the normal counting numbers that you learned in kindergarten (1, 2, 3…). Zero technically isn’t a natural number, but sometimes it’s convenient to include it (our binary tree example from last time was one instance where it was useful).

Imagine that we lined up all the Borg drones in the collective in a big line that stretches off to infinity, and gave each of them a number, starting at 1.

The Borg can assimilate other species and make them into Borg drones, so the collective is always growing. Suppose a poor dumb Bolian gets assimilated and has to join the line, but we don’t want to make him walk all the way to the end, so we let him cut in right here in the middle, at position 498. Since we want our Borg line to be in order by number, we just give our new Bolian drone the number 498, and the other drones’ numbers all go up by one. No matter how many drones were in our line before we let the Bolian cut in at position 498, there are always enough natural numbers that we can bump everyone up by one without running out. And no matter how many new drones we allow to cut into the line at the middle, we can always repeat this move and bump everyone’s number up by one. Everyone always gets an index.

This is what mathematicians mean when they say the natural numbers are infinite: no matter how many numbers we assign to drones, there are always more of them. The natural numbers never run out.

## The Peano axioms

But how do we know that bumping everyone’s number up by one doesn’t somehow change someone’s number into something that’s not a natural number? How do we know that we won’t bump up number 5354 and somehow end up with number 5354.12?

Your first reaction to that was probably “Of course we won’t. How could that happen? That’s stupid!” But mathematicians like to be extra careful about things like that. They like to prove that it can’t happen. But there’s really no obvious way to actually prove that in the mathematical sense. All we know is that if we have five mangos and we get another mango, we have six mangos; we don’t have 5.5 unless we cut the sixth mango in half and eat one half.

When mathematicians can’t prove something, but it seems obvious or self-evident that it should be true, they often make it into an axiom. Axioms are rules that we can’t prove, but that seem like they should be true, so we just assume they are and use them to prove things. Even the most skeptical mind can’t constantly be in doubt about everything, or knowledge would never get anywhere. For example, when scientists run experiments, they usually assume that they aren’t hallucinating the entire experiment from the inside of a padded cell. You can’t really prove this; we could all be in the Matrix right now. But it seems unproductive to worry about that, so we just assume that isn’t the case and go about our lives. In the same way, mathematicians are allowed to assume certain things are true, if it seems like it just couldn’t work any other way. Non-Euclidean geometry is an interesting example of what can happen if you change your axioms a little, but we won’t go into that here (maybe in a future post).

So there are certain things we just assume to be true about the natural numbers, certain axioms that define them. These are called the Peano axioms, after Italian mathematician Giuseppe Peano, who worked on mathematical induction. See Wikipedia for more information.

For induction, we mostly care about just three of the Peano axioms. Those are:

1. 1 is a natural number.
2. If $n$ is a natural number, then $n+1$ is a natural number.
3. If $K$ is a set which contains 1 and which contains $n+1$ whenever it contains $n$, then $K$ contains every natural number.

The first two give us a recursive definition (!) of the natural numbers. 1 is a natural number. Since 1 is a natural number, 1+1 = 2 is also a natural number. Since 2 is a natural number, 2+1 = 3 is also a natural number. So on. We can go as high as we want, but hopefully you see what’s happening; if we start with one Borg drone, labeled 1, and keep adding drones to the front of the line, the original drone’s label will keep going higher and higher. According to the second axiom above, no matter how high the first drone’s number gets, it’s always a natural number, because it was a natural number when we started and we just bumped it up by one.

Mathematicians often say that $n+1$ is the successor of $n$. Rather than saying that you add 1 to a natural number to get the next natural number, they think of it as applying a successor function, $S(n)$. This way of thinking about it is useful later, when we start applying the ideas behind the Peano axioms to things other than numbers, like binary trees.

The third axiom above is called the inductive axiom. It’s what allows us to write proofs by induction. But we need one more idea to see exactly why induction is valid: the isomorphism. We’ll go over that next time. Isomorphism will show us how we can create a set of natural numbers that represent some other object, like sums or binary trees, prove things about that set, and translate those proofs into proofs about our original objects.

## Summary

• The natural numbers are infinite; no matter how many of them you have, there’s always another one.
• The natural numbers can be defined recursively by starting at 1 and taking the successor.
• One of the rules of the natural numbers, the inductive axiom, combines with the idea of isomorphism to give us proofs by induction.

# First Ode on Induction

Induction is one of the weirdest proof techniques around. It took me forever to fully understand it. It’s complicated; it has all these different parts; and it’s difficult to understand how it even accomplishes its stated goal of proving that something is true for all natural numbers. Most people probably have the following three questions the first time they see induction:

1. WTF did I just see?!
2. How did that even work?!
3. How did they ever come up with this?

I know I did. It’s unfortunate that induction is so hard to understand, because it’s also extremely important for computer science; in fact, probably 90% of all the proofs I’ve ever seen in theoretical computer science were proofs by induction. That’s because induction captures the idea of doing something repeatedly, just like the loops and recursion that are used all over the place in programs. But when my school covered induction (in a lower-division discrete math grab bag course), the teacher just slapped it on the board and said “There it is.” Most of the class just memorized how to do simple induction proofs, and by the time I was in upper-division theory of computation and algorithms classes, the professors were expecting proofs out of us that most of the students had no idea how to write, because they never learned how to apply induction to anything more complicated than $\sum_{i=0}^{n} i = \frac{n(n+1)}{2}$.

That’s why I’m going to try and give a little insight into induction. Once I knew what I’d just seen, how it even worked, and how anyone ever came up with it, I saw that induction is actually a fascinating and elegant proof technique. I also, eventually, saw why it’s so important for computer science, but most CS classes treat math as a tool, like vi or Git, that you learn on your own time; throw up some formulas and some examples to copy, and everyone goes home and memorizes it. I learned what I learned about induction from math classes, and I’m going to try and put it out there so all CS majors can learn to appreciate induction.

## An Example, just to set the stage

First, though, I am going to make a total liar of myself and walk you through an example. That’s because induction has so many parts to it that I want to make sure we’re all on the same page about that before I start going into the background. I’m going to assume that you’ve seen all the simple examples involving sums, and that you’re ready for something a bit more complicated. So let’s talk about binary trees. Actually, let’s talk about full or perfect binary trees. And here one is:

(Source: http://www.csee.umbc.edu/~chang/cs202.f98/readings/trees/trees.html. Go there if you’re unclear at all about trees.)

In textbooks, the statements to be proved are always provided for you, free of charge, but in general you have to actually do some digging to figure out just what you’re going to prove. To make this more real, let’s suppose that our binary tree represents something—for example, we can say it’s a decision tree for some kind of simple AI program that decides whether it likes a car or not based on what features the car has. Each node is a feature; taking the left branch means yes, the car has the feature, while taking the right branch means no, the car doesn’t have the feature. Something like this:

(I know, no one uses CD players anymore, but fitting “stylish adaptive media center with voice-activated controls and iCharging Station” onto the picture just wasn’t worth it.)

You can see that when we add a feature, we don’t just add one new node for that feature. We had just one node for leather seats, but we needed two for air conditioning—one for the reality where we had leather seats, and one for the reality where we didn’t. Then when we added CD player, we needed four nodes—one for the reality where we had leather seats and air conditioning, one for the reality where we had leather seats and no air conditioning, one for the reality where we didn’t have leather seats but had air conditioning, and one for the reality where we had neither leather seats nor air conditioning. You can probably see that if we add another feature—say, power windows—we’ll need eight new nodes: one for leather seats, air conditioning,CD player, power windows; one for leather seats, air conditioning, CD player, no power windows; one for leather seats, air conditioning, no CD player, power windows; one for…

Let’s say that the children of the bottom feature nodes are decision nodes–each one represents a decision of yes, buy the car, or no, don’t buy it. Then we could have something like this:

In this case the decision of whether to buy or not is based only on whether the car has air conditioning, but we could have made a separate decision for every possible combination of features, and then we’d have eight decision nodes down there at the bottom.

We can see just from this small example that the number of nodes we have to store gets big pretty quickly as we add more and more choices, but just how quickly? What we’d like is some kind of formula where we plug in the number of features we want to consider and get back how many nodes that will require. That way, if we know how big a node is and how much memory our computer has, we can figure out how many features we can have before it crashes the computer.

(I hope it’s obvious that having a formula is better than just trying some random number of features, seeing if the program crashes, and trying some random smaller number if it does. Even an approximation formula that gets you in the right ballpark is helpful.)

If you try out some examples—adding features, drawing in the new nodes, and counting them—you’ll probably notice a pattern: each feature requires exactly twice as many nodes as the feature before it. Leather seats only required one node, but air conditioning required $2 \times 1 = 2$ nodes, and CD player required $2 \times 2 = 4$ nodes.  This makes sense, because each node leads to two children—a yes child and a no child—and all the child nodes of one feature represent the next feature. So how do we turn that into a formula? Well, if we want to know how many nodes there are in the whole tree, we can just add up the number of nodes for all the features. And we know that each feature has twice as many nodes as the feature before it. So if we have $f$ features, we can use something like $N = 1 + 2\times 1 + 2\times 2 \times 1 +\ldots + 2\times 2 \times \ldots (f$ times) $\times 1$, for $N$ the number of nodes in the tree, right? Or in somewhat cleaned up, more mathy notation:

$N = \sum_{i = 0}^{f} 2^i$.

## Let’s not be hasty

Now that you’re a big time computer scientist, you’re not allowed to assume that something always works just because it seems to work for a few small examples. This is where induction comes in.

While it might seem obvious that every level of the tree (which corresponds to a feature in our example; again, see a tutorial on trees if you’re drawing a blank on this term) has twice as many nodes as the level before it, we can’t actually be sure of this. There are some truly bizarre patterns and arbitrary cutoff points in math, where things that seem the same are totally different, or things that seem different are really the same thing in disguise. We’ll see an example of this later when we talk about the PCI and the well-ordering principle. For other examples, check out finding the roots of polynomials (something magical happens between a polynomial of degree 4 and one of degree 5) or graph coloring (between 3 and 4 here).

We have two assumptions in the formula above: that every level of a binary tree has twice as many nodes as the level above it, and that the sum of nodes in every level of the tree is the total number of nodes in the tree. Both of these need to be proved before we can say for sure that the formula above is always true, no matter how many levels our tree has.

So first of all, let’s prove that every level of a tree has twice as many nodes as the level before it, or equivalently, the $i$th level of the tree has $2 \times 2^{i-1} = 2^i$ nodes, using induction. If you’re familiar with the sum examples I assumed you’d seen, this should be fairly straightforward. First we need to decide what to induct on. This decision is pretty much made for us here—it’s the levels—but in more complicated examples, it’s often not obvious what to induct on. Next we need a base case. The obvious candidate is the root. You might be tempted to make the root level 1, but the root level has one node, and $2^1 = 2$. If we make the root level 0, then $2^0 = 1$, and we’re good.

This is a fine point to keep in mind: the base case can start anywhere. It does not have to start at 1 or 0. It could start at 22, or 69 (if you’re that sort of person), but remember that if you put the base case at 69, you have only proved the theorem for when the relevant quantity is 69 or higher. If we chose to start this example at 69, then our proof would only show that if the tree has 69 or more levels, every level past level 69 has twice as many nodes as the previous level. So while you can bring dorm room humor into theoretical CS if you want to, you’re really just hurting yourself in the end, because you haven’t proved that the theorem is true for a tree with 68 or fewer levels. Here we’re starting at 0, so every tree that has a root (i.e. pretty much all of them) has the property that level $i$ has twice the number of nodes of level $i-1$.

So we have our base case. Next is our inductive hypothesis. This is the weirdest step for beginners to understand. It blew my brain for several months after I first learned about induction. That’s because it doesn’t actually involve anything factual. Instead, the inductive hypothesis and inductive conclusion together do pretty much what you do when your friend says he was abducted by aliens and you reply “Assuming aliens existed” (your hypothesis), “they wouldn’t fly across the galaxy just to see you naked” (your conclusion). Induction, though, does more than this, because we have the base case, so we know that the theorem is true in at least one case; it’s as if we said “Here’s an alien. He says he wouldn’t fly across the galaxy to see you naked (base case). Assuming more aliens exist (hypothesis), they wouldn’t fly across the galaxy to see you naked either (conclusion).” In future parts, we’ll see the mathematical formalism behind this, which will show why an induction is a valid proof and not just an invalid generalization based on a single example.

So for our inductive hypothesis for the tree level example, we assume that $N_k$, the number of nodes in level $k$, is equal to $2 \times N_{k-1}$, twice the number of nodes in level $k-1$. And we assume that this was the case for every level back down to level 0, so that $N_{k-1} = 2 \times 2 \times \ldots = 2^{k-1}$ and $N_k = 2 \times 2^{k-1} = 2^k$. Now we want to show that, given this assumption, $N_{k+1} = 2 \times N_k$; i.e. there are twice as many nodes in level $k+1$ as in level $k$.

From here it’s really just algebra. We’re trying to prove that $N_{k+1} = 2^{k+1}$. We can pick one side of this equation and try to transform it into a new form that involves the inductive hypothesis; then we can use the assumption of the inductive hypothesis to show that the conclusion is true. We know that $2^{k+1} = 2 \times 2^k$; given that we assumed $N_k$ was equal to $2^k$, we can now see that $N_{k+1} = 2^{k+1} = 2 \times 2^{k} = 2 \times N_{k}$, which is exactly what we wanted to show: that level $k+1$ has twice as many nodes as level $k$. We have proven the first thing we needed to prove.

Now we need to show the second thing, that in a tree with $n$ levels, there are $\sum_{i = 0}^{n}2^i$ nodes. First the base case: a tree with zero levels (i.e. only a root) has $\sum_{i = 0}^{0} 2^i = 2^0 = 1$ node. Now, for the inductive hypothesis, we assume that in a tree with $k$ levels, there are $\sum_{i=0}^{k} 2^i$ nodes. We then try to show that a tree with $k+1$ levels has $\sum_{i = 0}^{k+1}2^i$ nodes. If $N_{k+1}$ is the number of nodes in a tree with $k+1$ levels, we can show this as follows:

$N_{k+1} = \sum_{i = 0}^{k+1} 2^i$ is what we’re trying to prove.

$\sum_{i = 0}^{k+1} 2^i = \sum_{i=0}^{k} 2^i + 2^{k+1}$

We know that $\sum_{i=0}^{k} 2^i = N_k$ because of our inductive hypothesis, and we know that level $k$ has $2^{k}$ nodes because of what we proved above, so adding the ${k+1}$st new level must have added $2^{k+1}$ nodes. This proves the formula. We can now take it and use it to find out how many features our decision tree can have before it’s too big to fit in memory.

## What was the point?

I walked through that whole example, from conception to culmination, because I wanted to give you some kind of idea about why induction is used in computer science. We saw that the binary tree was something we could keep building by applying the same rules repeatedly (building a new level by giving each leaf in the previous level two children). That sounds a lot like a loop! Indeed, if we were building that tree in a program, we’d probably have a loop that would read a list of car features and build the tree up by adding children, making levels for each feature. All we had to do was number each level in our tree; then we could use induction, and prove that no matter how many levels the tree has, the same properties still apply.

As we’ll see in the next part, you can use induction any time your objects can be numbered in some way using the natural numbers so that there’s an operation to transform the $k$th object into the $k+1$st object, for any value of $k$. This is what makes induction so useful in computer science: it’s basically a way of showing that algorithms with loops or recursion in them are mathematically valid—that if we write a for-loop with 1001 iterations, it does exactly the same thing as a loop with 1000 iterations, just on a slightly larger scale.

## Summary

• Induction is a very important proof technique in computer science, because it captures the idea of doing something repeatedly. Loops and recursion can both be proven correct using induction.
• Proofs by induction have three parts: the base case (“This alien says he wouldn’t travel across the galaxy to see you naked.”), the inductive hypothesis (“Assuming that more than one alien exists…”), and the inductive conclusion (“…they wouldn’t travel across the galaxy to see you naked either.”).