A Comparison of Four Algorithms Textbooks

At some point, you can’t get any further with linked lists, selection sort, and voodoo Big O, and you have to go get a real algorithms textbook and learn all that horrible math, at least a little. But which book? There are tons of them.

I haven’t read every algorithms book out there, but I have read four of them. Maybe my experience with these four can help guide your decision. The four books are Algorithms, by Dasgupta, Papadimitriou, and Vazirani (hereafter called Dasgupta); Introduction to Algorithms, by Cormen, Leiserson, Rivest, and Stein (hereafter called CLRS); The Algorithm Design Manual, by Steve Skiena (hereafter called Skiena); and The Art of Computer Programming, Volumes 1-3, by Donald Knuth. I’ll do a five-point comparison, going over the prose style, code use, mathematical heaviness, breadth and depth of topics, and position on the continuum between theoretical and practical of each book.

There’s one thing you should know before I start: Dasgupta is available for free online, while Knuth and CLRS aren’t (well, they probably are, but not legally). So that might make your decision for you. I haven’t been able to determine if Skiena is legally available online. The book is posted in either PDF or HTML on a few legit-ish looking sites, but Skiena’s own page for it doesn’t mention anything about it being freely available, so proceed with caution.

Does it have to be one of these four?

Not at all. There are lots of good algorithms books that I’ve never read. You can even learn a lot by just reading CS.SE and Wikipedia. But these four are the ones I’ve personally used. Many people like one of these four, but they do reflect my taste and biases. But even if you decide to go with a different book, this overview might at least tell you what features to notice.

Prose style

A good algorithms book will usually explain its topics in three ways: with a natural language prose explanation, with some kind of implementation code, and with mathematical notation. When you’re starting out, the prose is usually the most important part, because it takes you from “What the hell is a binary search?” to “Got it, how do I write actual code for a binary search?”

For me, the winner on prose style has to be Knuth. He’s just a masterful writer. You might have to read some of his sentences twice, but that’s only because he gets across twice as much information as a lesser writer. Knuth’s prose explanations got me to understanding on several topics I’d despaired of ever getting with other books, like B-trees and merge sort.

Skiena is also an excellent prose stylist. Where Knuth is elegant and flowing, like the John Milton of algorithms, Skiena is direct and sharp, like the Ernest Hemingway of algorithms. Knuth and Skiena are good at different things: Skiena is good at finding a simple, direct way to explain things which are traditionally made more complicated than they need to be. His overview of Big O is amazingly clear and simple, as long as you have at least some memory of calculus. On the other hand, some topics are inherently really complicated, and this is where Knuth shines: he’ll give you several different ways to view a complicated topic, and chances are at least one of them will work for you.

Knuth is much quicker to jump to math, while Skiena mostly eschews math and tries to keep everything intuitive. We’ll discuss that more in the section on mathematical heaviness.

Dasgupta has a pretty good prose style too, more patient and beginner-friendly than either Skiena or Knuth. Sometimes, I found Dasgupta too verbose and felt the authors belabored the obvious too much, but other times that was an asset rather than a weakness (e.g. in the material on Fast Fourier Transforms).

CLRS probably has the weakest prose style of these four books. It’s closer to the standard math text style, written with a formal and distant air, sometimes favoring precision over clarity. It’s strictly functional, but tolerable if you already have some understanding of the topic.

Code use

Knuth’s big weakness is code. He uses two different notations, pseudocode and MIX assembly language, his imaginary assembly for his imaginary computer. I find both of them extremely difficult to follow.

The problems with MIX should be pretty obvious: it’s assembly language, so it’s just hard to read. Not only that, it’s pretty different from the MIPS, ARM, or x86 assembly that modern readers might have seen. It’s designed to be run on either a decimal or binary architecture and assumes a six-bit word. Knuth put a ton of effort into creating this imaginary machine and its assembly language. Since it’s made up, MIX assembly is still technically pseudocode; but MIX is to pseudocode as Quenya is to pseudo-languages. Newer editions of TAoCP use MMIX, which was redesigned to reflect modern assembly languages. I haven’t seen it yet, but I imagine it’s still hard to read since it’s assembly language.

Knuth also uses high-level pseudocode, but I find that hard to read too because it’s organized like an unstructured language with goto statements. If I were planning to implement the algorithms in an unstructured language, it would probably be fine, but I’ve always found that there’s a nontrivial translation process between Knuth’s unstructured pseudocode and structured pseudocode suitable for implementation in a modern language.

CLRS and Dasgupta both use high-level pseudocode that resembles Pascal, although not too slavishly. CLRS expresses some things as if they were methods or fields of an object, in a semi-object oriented way. Skiena also does some of this, but in addition he uses real C code.

A lot of modern books use C or Java throughout, which might be more to some people’s tastes. These books reflect my taste, and I like my algorithms language-independent. I don’t mind how Skiena uses C—he uses it mainly to show how to implement something when there’s a significant gap between theory and practice. But I’m glad he stuck to pseudocode for the most part.

Mathematical Heaviness

In this category, going from heaviest to lightest, we have roughly Knuth > CLRS > Dasgupta > Skiena. Dasgupta and Skiena are pretty close, while there’s a significant gap between them and CLRS. There’s a minor gap between CLRS and Knuth, and it really depends on the topic, but CLRS just didn’t take on as many inherently mathematical topics as Knuth did (especially in Knuth’s Volume 2, Seminumerical Algorithms, where he deals with random number generation and efficient arithmetic).

In general, Dasgupta is geared towards people (like students) who just recently learned a little math and are going to learn a little more, but haven’t internalized it all yet. Skiena is geared towards people (like working programmers, or graduate students) who’ve learned some math and forgotten a lot of it. Skiena is also student friendly, but he stresses practicality more than anything, and sometimes, in practice, you gotta do some math.

CLRS is also for students, but it’s more at the graduate student level, especially the later “Selected Topics” chapters that can sometimes get pretty heavy. Knuth seems more geared towards graduate students near the end of their Ph.Ds and practicing researchers.

All four books require you to know some combinatorics and a little number theory, and to have at least a vague memory of limits and function growth from calculus. That’s about it for Skiena and Dasgupta, though Skiena seems to expect you to have internalized the math a little more than Dasgupta does. The basic chapters of CLRS are just a little beyond Skiena in math background, but some topics get quite heavily mathematical. Knuth gets pretty heavily mathematical in almost every topic.

I’ve been conflating two things in the discussion so far, since they’re sort of linked: actual knowledge of theorems and axioms in mathematics, and the ability to read mathematical writing and proofs and follow an author’s reasoning (what’s often called “mathematical maturity”). Knuth requires the most technical knowledge of mathematics of the four here, and he’ll also stretch your reading ability pretty far; but CLRS will also stretch your reading ability, and Dasgupta is no cakewalk either, although I do think Dasgupta qualifies as a picnic. Skiena doesn’t have any proofs, but he kind of sneaks in proof-like things with his “war stories”, which usually read quite a bit like proofs, extended with discussion of some of the approaches he tried that didn’t work out.

If you’re getting your first algorithms book and you’re a CS undergrad, Dasgupta or Skiena is easiest to follow. CLRS will stretch you a little, but it’s manageable. If you’re a math or physics undergrad, CLRS shouldn’t be too hard and Knuth might be doable.

Breadth and Depth of Topics

Dasgupta is the loser here; not only does the book cover fewer topics than the others, the topics it chooses to cover are poorly organized and somewhat eccentric. I’m not sure why the authors threw in the awful chapter on quantum computing at the end; it’s totally incomprehensible unless you already understand quantum mechanics, which is no mean feat.

CLRS has the best balance of breadth and depth: it covers basic data structures; some more advanced ones like red-black trees, B-trees, and the union-find data structure; algorithmic paradigms like greediness and dynamic programming; graphs, shortest paths, and network flows; sorting; amortized analysis; and an assortment of other topics such as number theoretic algorithms, cryptography, linear programming, string matching, and NP completeness.

Skiena covers about the first third of CLRS, but he does a lot more with NP complete problems and how to design approximation schemes for them than CLRS does.

Knuth, of course, is the master of depth. Volume 1 covers math background and fundamental data structures; Volume 2 covers random number generation and arithmetic; Volume 3 covers searching and sorting, going through various sort routines and some more advanced data structures, such as B-trees, as well as developing the whole theory behind hash tables and how to choose hashing functions and table sizes; and Volume 4 covers combinatorial algorithms. Volume 4 is split into three subvolumes; right now, only Volume 4A has actually come out.

If you want to get just one book, I would get Skiena or CLRS. They include all the most important topics both in practice and for undergraduate classes. Skiena, as a bonus, even includes some computational geometry material, in case you’re doing video games or computer graphics.

Theoretical vs Practical

Skiena is the most relentlessly practical of this bunch. Knuth was actually pretty practical at the time Volume 1 came out, but he became less and less practical as time went on because people stopped writing things in assembly or using tape drives. Both give you implementation tips to make your code perform better, although Knuth’s can be hard to capitalize on since you’re not writing in assembly.

CLRS and Dasgupta are both theoretical in the sense that they mostly ignore the computer. Everything in CLRS will work, but sometimes their approach is too “straight from the theory”, as I discovered when I implemented Karp-Rabin according to their discussion and put it on Code Review.SE after struggling with numerous overflow-related bugs, only to have someone suggest a different approach that rectified all the performance issues and handled overflow elegantly.

Dasgupta and CLRS are both still good books, and if you’re just starting out learning algorithms, don’t get too caught up on this issue. Write your code in Python or Ruby, some quick and easy language that also ignores the metal. You’ll get the idea of the implementation, and you can deal with the issues around implementing it in some other language later. (Python even looks like the pseudocode in CLRS.)

Conclusion

I probably use CLRS the most, because of its breadth of coverage and because its pseudocode is so much easier to follow than Knuth’s. If I don’t understand the concept or the mathematical underpinnings, I’ll go to Knuth for the deep discussion, and the excellent prose style.

I just got Skiena recently, but in the future, I expect him to usurp CLRS somewhat. He covers most of the same material, but his prose style is more direct and his approach is more practical. Skiena is excellently organized and has a catalogue of problems and their algorithmic solutions in the second half of his book, good for browsing when you’re stumped.

I don’t use Dasgupta that much. Dasgupta was the text for my undergrad algorithms class, and while it was good in that capacity, it’s not really a book that continues to be useful once you’re past the first-semester course, mainly because of the lack of breadth in coverage and the eccentric organization and choice of topics.

Advertisements

First Ode on Induction

Induction is one of the weirdest proof techniques around. It took me forever to fully understand it. It’s complicated; it has all these different parts; and it’s difficult to understand how it even accomplishes its stated goal of proving that something is true for all natural numbers. Most people probably have the following three questions the first time they see induction:

  1. WTF did I just see?!
  2. How did that even work?!
  3. How did they ever come up with this?

I know I did. It’s unfortunate that induction is so hard to understand, because it’s also extremely important for computer science; in fact, probably 90% of all the proofs I’ve ever seen in theoretical computer science were proofs by induction. That’s because induction captures the idea of doing something repeatedly, just like the loops and recursion that are used all over the place in programs. But when my school covered induction (in a lower-division discrete math grab bag course), the teacher just slapped it on the board and said “There it is.” Most of the class just memorized how to do simple induction proofs, and by the time I was in upper-division theory of computation and algorithms classes, the professors were expecting proofs out of us that most of the students had no idea how to write, because they never learned how to apply induction to anything more complicated than \sum_{i=0}^{n} i = \frac{n(n+1)}{2}.

That’s why I’m going to try and give a little insight into induction. Once I knew what I’d just seen, how it even worked, and how anyone ever came up with it, I saw that induction is actually a fascinating and elegant proof technique. I also, eventually, saw why it’s so important for computer science, but most CS classes treat math as a tool, like vi or Git, that you learn on your own time; throw up some formulas and some examples to copy, and everyone goes home and memorizes it. I learned what I learned about induction from math classes, and I’m going to try and put it out there so all CS majors can learn to appreciate induction.

An Example, just to set the stage

First, though, I am going to make a total liar of myself and walk you through an example. That’s because induction has so many parts to it that I want to make sure we’re all on the same page about that before I start going into the background. I’m going to assume that you’ve seen all the simple examples involving sums, and that you’re ready for something a bit more complicated. So let’s talk about binary trees. Actually, let’s talk about full or perfect binary trees. And here one is:

Image

(Source: http://www.csee.umbc.edu/~chang/cs202.f98/readings/trees/trees.html. Go there if you’re unclear at all about trees.)

In textbooks, the statements to be proved are always provided for you, free of charge, but in general you have to actually do some digging to figure out just what you’re going to prove. To make this more real, let’s suppose that our binary tree represents something—for example, we can say it’s a decision tree for some kind of simple AI program that decides whether it likes a car or not based on what features the car has. Each node is a feature; taking the left branch means yes, the car has the feature, while taking the right branch means no, the car doesn’t have the feature. Something like this:

Image

(I know, no one uses CD players anymore, but fitting “stylish adaptive media center with voice-activated controls and iCharging Station” onto the picture just wasn’t worth it.)

You can see that when we add a feature, we don’t just add one new node for that feature. We had just one node for leather seats, but we needed two for air conditioning—one for the reality where we had leather seats, and one for the reality where we didn’t. Then when we added CD player, we needed four nodes—one for the reality where we had leather seats and air conditioning, one for the reality where we had leather seats and no air conditioning, one for the reality where we didn’t have leather seats but had air conditioning, and one for the reality where we had neither leather seats nor air conditioning. You can probably see that if we add another feature—say, power windows—we’ll need eight new nodes: one for leather seats, air conditioning,CD player, power windows; one for leather seats, air conditioning, CD player, no power windows; one for leather seats, air conditioning, no CD player, power windows; one for…

Let’s say that the children of the bottom feature nodes are decision nodes–each one represents a decision of yes, buy the car, or no, don’t buy it. Then we could have something like this:

Image

In this case the decision of whether to buy or not is based only on whether the car has air conditioning, but we could have made a separate decision for every possible combination of features, and then we’d have eight decision nodes down there at the bottom.

We can see just from this small example that the number of nodes we have to store gets big pretty quickly as we add more and more choices, but just how quickly? What we’d like is some kind of formula where we plug in the number of features we want to consider and get back how many nodes that will require. That way, if we know how big a node is and how much memory our computer has, we can figure out how many features we can have before it crashes the computer.

(I hope it’s obvious that having a formula is better than just trying some random number of features, seeing if the program crashes, and trying some random smaller number if it does. Even an approximation formula that gets you in the right ballpark is helpful.)

If you try out some examples—adding features, drawing in the new nodes, and counting them—you’ll probably notice a pattern: each feature requires exactly twice as many nodes as the feature before it. Leather seats only required one node, but air conditioning required 2 \times 1 = 2 nodes, and CD player required 2 \times 2 = 4 nodes.  This makes sense, because each node leads to two children—a yes child and a no child—and all the child nodes of one feature represent the next feature. So how do we turn that into a formula? Well, if we want to know how many nodes there are in the whole tree, we can just add up the number of nodes for all the features. And we know that each feature has twice as many nodes as the feature before it. So if we have f features, we can use something like N = 1 + 2\times 1 + 2\times 2 \times 1 +\ldots + 2\times 2 \times \ldots (f times) \times 1, for N the number of nodes in the tree, right? Or in somewhat cleaned up, more mathy notation:

N = \sum_{i = 0}^{f} 2^i.

Let’s not be hasty

Now that you’re a big time computer scientist, you’re not allowed to assume that something always works just because it seems to work for a few small examples. This is where induction comes in.

While it might seem obvious that every level of the tree (which corresponds to a feature in our example; again, see a tutorial on trees if you’re drawing a blank on this term) has twice as many nodes as the level before it, we can’t actually be sure of this. There are some truly bizarre patterns and arbitrary cutoff points in math, where things that seem the same are totally different, or things that seem different are really the same thing in disguise. We’ll see an example of this later when we talk about the PCI and the well-ordering principle. For other examples, check out finding the roots of polynomials (something magical happens between a polynomial of degree 4 and one of degree 5) or graph coloring (between 3 and 4 here).

We have two assumptions in the formula above: that every level of a binary tree has twice as many nodes as the level above it, and that the sum of nodes in every level of the tree is the total number of nodes in the tree. Both of these need to be proved before we can say for sure that the formula above is always true, no matter how many levels our tree has.

So first of all, let’s prove that every level of a tree has twice as many nodes as the level before it, or equivalently, the ith level of the tree has 2 \times 2^{i-1} = 2^i nodes, using induction. If you’re familiar with the sum examples I assumed you’d seen, this should be fairly straightforward. First we need to decide what to induct on. This decision is pretty much made for us here—it’s the levels—but in more complicated examples, it’s often not obvious what to induct on. Next we need a base case. The obvious candidate is the root. You might be tempted to make the root level 1, but the root level has one node, and 2^1 = 2. If we make the root level 0, then 2^0 = 1, and we’re good.

This is a fine point to keep in mind: the base case can start anywhere. It does not have to start at 1 or 0. It could start at 22, or 69 (if you’re that sort of person), but remember that if you put the base case at 69, you have only proved the theorem for when the relevant quantity is 69 or higher. If we chose to start this example at 69, then our proof would only show that if the tree has 69 or more levels, every level past level 69 has twice as many nodes as the previous level. So while you can bring dorm room humor into theoretical CS if you want to, you’re really just hurting yourself in the end, because you haven’t proved that the theorem is true for a tree with 68 or fewer levels. Here we’re starting at 0, so every tree that has a root (i.e. pretty much all of them) has the property that level i has twice the number of nodes of level i-1.

So we have our base case. Next is our inductive hypothesis. This is the weirdest step for beginners to understand. It blew my brain for several months after I first learned about induction. That’s because it doesn’t actually involve anything factual. Instead, the inductive hypothesis and inductive conclusion together do pretty much what you do when your friend says he was abducted by aliens and you reply “Assuming aliens existed” (your hypothesis), “they wouldn’t fly across the galaxy just to see you naked” (your conclusion). Induction, though, does more than this, because we have the base case, so we know that the theorem is true in at least one case; it’s as if we said “Here’s an alien. He says he wouldn’t fly across the galaxy to see you naked (base case). Assuming more aliens exist (hypothesis), they wouldn’t fly across the galaxy to see you naked either (conclusion).” In future parts, we’ll see the mathematical formalism behind this, which will show why an induction is a valid proof and not just an invalid generalization based on a single example.

So for our inductive hypothesis for the tree level example, we assume that N_k, the number of nodes in level k, is equal to 2 \times N_{k-1}, twice the number of nodes in level k-1. And we assume that this was the case for every level back down to level 0, so that N_{k-1} = 2 \times 2 \times \ldots = 2^{k-1} and N_k = 2 \times 2^{k-1} = 2^k. Now we want to show that, given this assumption, N_{k+1} = 2 \times N_k; i.e. there are twice as many nodes in level k+1 as in level k.

From here it’s really just algebra. We’re trying to prove that N_{k+1} = 2^{k+1}. We can pick one side of this equation and try to transform it into a new form that involves the inductive hypothesis; then we can use the assumption of the inductive hypothesis to show that the conclusion is true. We know that 2^{k+1} = 2 \times 2^k; given that we assumed N_k was equal to 2^k, we can now see that N_{k+1} = 2^{k+1} = 2 \times 2^{k} = 2 \times N_{k}, which is exactly what we wanted to show: that level k+1 has twice as many nodes as level k. We have proven the first thing we needed to prove.

Now we need to show the second thing, that in a tree with n levels, there are \sum_{i = 0}^{n}2^i nodes. First the base case: a tree with zero levels (i.e. only a root) has \sum_{i = 0}^{0} 2^i = 2^0 = 1 node. Now, for the inductive hypothesis, we assume that in a tree with k levels, there are \sum_{i=0}^{k} 2^i nodes. We then try to show that a tree with k+1 levels has \sum_{i = 0}^{k+1}2^i nodes. If N_{k+1} is the number of nodes in a tree with k+1 levels, we can show this as follows:

N_{k+1} = \sum_{i = 0}^{k+1} 2^i is what we’re trying to prove.

\sum_{i = 0}^{k+1} 2^i = \sum_{i=0}^{k} 2^i + 2^{k+1}

We know that \sum_{i=0}^{k} 2^i = N_k because of our inductive hypothesis, and we know that level k has 2^{k} nodes because of what we proved above, so adding the {k+1}st new level must have added 2^{k+1} nodes. This proves the formula. We can now take it and use it to find out how many features our decision tree can have before it’s too big to fit in memory.

What was the point?

I walked through that whole example, from conception to culmination, because I wanted to give you some kind of idea about why induction is used in computer science. We saw that the binary tree was something we could keep building by applying the same rules repeatedly (building a new level by giving each leaf in the previous level two children). That sounds a lot like a loop! Indeed, if we were building that tree in a program, we’d probably have a loop that would read a list of car features and build the tree up by adding children, making levels for each feature. All we had to do was number each level in our tree; then we could use induction, and prove that no matter how many levels the tree has, the same properties still apply.

As we’ll see in the next part, you can use induction any time your objects can be numbered in some way using the natural numbers so that there’s an operation to transform the kth object into the k+1st object, for any value of k. This is what makes induction so useful in computer science: it’s basically a way of showing that algorithms with loops or recursion in them are mathematically valid—that if we write a for-loop with 1001 iterations, it does exactly the same thing as a loop with 1000 iterations, just on a slightly larger scale.

Summary

  • Induction is a very important proof technique in computer science, because it captures the idea of doing something repeatedly. Loops and recursion can both be proven correct using induction.
  • Proofs by induction have three parts: the base case (“This alien says he wouldn’t travel across the galaxy to see you naked.”), the inductive hypothesis (“Assuming that more than one alien exists…”), and the inductive conclusion (“…they wouldn’t travel across the galaxy to see you naked either.”).