A Comparison of Four Algorithms Textbooks

At some point, you can’t get any further with linked lists, selection sort, and voodoo Big O, and you have to go get a real algorithms textbook and learn all that horrible math, at least a little. But which book? There are tons of them.

I haven’t read every algorithms book out there, but I have read four of them. Maybe my experience with these four can help guide your decision. The four books are Algorithms, by Dasgupta, Papadimitriou, and Vazirani (hereafter called Dasgupta); Introduction to Algorithms, by Cormen, Leiserson, Rivest, and Stein (hereafter called CLRS); The Algorithm Design Manual, by Steve Skiena (hereafter called Skiena); and The Art of Computer Programming, Volumes 1-3, by Donald Knuth. I’ll do a five-point comparison, going over the prose style, code use, mathematical heaviness, breadth and depth of topics, and position on the continuum between theoretical and practical of each book.

There’s one thing you should know before I start: Dasgupta is available for free online, while Knuth and CLRS aren’t (well, they probably are, but not legally). So that might make your decision for you. I haven’t been able to determine if Skiena is legally available online. The book is posted in either PDF or HTML on a few legit-ish looking sites, but Skiena’s own page for it doesn’t mention anything about it being freely available, so proceed with caution.

Does it have to be one of these four?

Not at all. There are lots of good algorithms books that I’ve never read. You can even learn a lot by just reading CS.SE and Wikipedia. But these four are the ones I’ve personally used. Many people like one of these four, but they do reflect my taste and biases. But even if you decide to go with a different book, this overview might at least tell you what features to notice.

Prose style

A good algorithms book will usually explain its topics in three ways: with a natural language prose explanation, with some kind of implementation code, and with mathematical notation. When you’re starting out, the prose is usually the most important part, because it takes you from “What the hell is a binary search?” to “Got it, how do I write actual code for a binary search?”

For me, the winner on prose style has to be Knuth. He’s just a masterful writer. You might have to read some of his sentences twice, but that’s only because he gets across twice as much information as a lesser writer. Knuth’s prose explanations got me to understanding on several topics I’d despaired of ever getting with other books, like B-trees and merge sort.

Skiena is also an excellent prose stylist. Where Knuth is elegant and flowing, like the John Milton of algorithms, Skiena is direct and sharp, like the Ernest Hemingway of algorithms. Knuth and Skiena are good at different things: Skiena is good at finding a simple, direct way to explain things which are traditionally made more complicated than they need to be. His overview of Big O is amazingly clear and simple, as long as you have at least some memory of calculus. On the other hand, some topics are inherently really complicated, and this is where Knuth shines: he’ll give you several different ways to view a complicated topic, and chances are at least one of them will work for you.

Knuth is much quicker to jump to math, while Skiena mostly eschews math and tries to keep everything intuitive. We’ll discuss that more in the section on mathematical heaviness.

Dasgupta has a pretty good prose style too, more patient and beginner-friendly than either Skiena or Knuth. Sometimes, I found Dasgupta too verbose and felt the authors belabored the obvious too much, but other times that was an asset rather than a weakness (e.g. in the material on Fast Fourier Transforms).

CLRS probably has the weakest prose style of these four books. It’s closer to the standard math text style, written with a formal and distant air, sometimes favoring precision over clarity. It’s strictly functional, but tolerable if you already have some understanding of the topic.

Code use

Knuth’s big weakness is code. He uses two different notations, pseudocode and MIX assembly language, his imaginary assembly for his imaginary computer. I find both of them extremely difficult to follow.

The problems with MIX should be pretty obvious: it’s assembly language, so it’s just hard to read. Not only that, it’s pretty different from the MIPS, ARM, or x86 assembly that modern readers might have seen. It’s designed to be run on either a decimal or binary architecture and assumes a six-bit word. Knuth put a ton of effort into creating this imaginary machine and its assembly language. Since it’s made up, MIX assembly is still technically pseudocode; but MIX is to pseudocode as Quenya is to pseudo-languages. Newer editions of TAoCP use MMIX, which was redesigned to reflect modern assembly languages. I haven’t seen it yet, but I imagine it’s still hard to read since it’s assembly language.

Knuth also uses high-level pseudocode, but I find that hard to read too because it’s organized like an unstructured language with goto statements. If I were planning to implement the algorithms in an unstructured language, it would probably be fine, but I’ve always found that there’s a nontrivial translation process between Knuth’s unstructured pseudocode and structured pseudocode suitable for implementation in a modern language.

CLRS and Dasgupta both use high-level pseudocode that resembles Pascal, although not too slavishly. CLRS expresses some things as if they were methods or fields of an object, in a semi-object oriented way. Skiena also does some of this, but in addition he uses real C code.

A lot of modern books use C or Java throughout, which might be more to some people’s tastes. These books reflect my taste, and I like my algorithms language-independent. I don’t mind how Skiena uses C—he uses it mainly to show how to implement something when there’s a significant gap between theory and practice. But I’m glad he stuck to pseudocode for the most part.

Mathematical Heaviness

In this category, going from heaviest to lightest, we have roughly Knuth > CLRS > Dasgupta > Skiena. Dasgupta and Skiena are pretty close, while there’s a significant gap between them and CLRS. There’s a minor gap between CLRS and Knuth, and it really depends on the topic, but CLRS just didn’t take on as many inherently mathematical topics as Knuth did (especially in Knuth’s Volume 2, Seminumerical Algorithms, where he deals with random number generation and efficient arithmetic).

In general, Dasgupta is geared towards people (like students) who just recently learned a little math and are going to learn a little more, but haven’t internalized it all yet. Skiena is geared towards people (like working programmers, or graduate students) who’ve learned some math and forgotten a lot of it. Skiena is also student friendly, but he stresses practicality more than anything, and sometimes, in practice, you gotta do some math.

CLRS is also for students, but it’s more at the graduate student level, especially the later “Selected Topics” chapters that can sometimes get pretty heavy. Knuth seems more geared towards graduate students near the end of their Ph.Ds and practicing researchers.

All four books require you to know some combinatorics and a little number theory, and to have at least a vague memory of limits and function growth from calculus. That’s about it for Skiena and Dasgupta, though Skiena seems to expect you to have internalized the math a little more than Dasgupta does. The basic chapters of CLRS are just a little beyond Skiena in math background, but some topics get quite heavily mathematical. Knuth gets pretty heavily mathematical in almost every topic.

I’ve been conflating two things in the discussion so far, since they’re sort of linked: actual knowledge of theorems and axioms in mathematics, and the ability to read mathematical writing and proofs and follow an author’s reasoning (what’s often called “mathematical maturity”). Knuth requires the most technical knowledge of mathematics of the four here, and he’ll also stretch your reading ability pretty far; but CLRS will also stretch your reading ability, and Dasgupta is no cakewalk either, although I do think Dasgupta qualifies as a picnic. Skiena doesn’t have any proofs, but he kind of sneaks in proof-like things with his “war stories”, which usually read quite a bit like proofs, extended with discussion of some of the approaches he tried that didn’t work out.

If you’re getting your first algorithms book and you’re a CS undergrad, Dasgupta or Skiena is easiest to follow. CLRS will stretch you a little, but it’s manageable. If you’re a math or physics undergrad, CLRS shouldn’t be too hard and Knuth might be doable.

Breadth and Depth of Topics

Dasgupta is the loser here; not only does the book cover fewer topics than the others, the topics it chooses to cover are poorly organized and somewhat eccentric. I’m not sure why the authors threw in the awful chapter on quantum computing at the end; it’s totally incomprehensible unless you already understand quantum mechanics, which is no mean feat.

CLRS has the best balance of breadth and depth: it covers basic data structures; some more advanced ones like red-black trees, B-trees, and the union-find data structure; algorithmic paradigms like greediness and dynamic programming; graphs, shortest paths, and network flows; sorting; amortized analysis; and an assortment of other topics such as number theoretic algorithms, cryptography, linear programming, string matching, and NP completeness.

Skiena covers about the first third of CLRS, but he does a lot more with NP complete problems and how to design approximation schemes for them than CLRS does.

Knuth, of course, is the master of depth. Volume 1 covers math background and fundamental data structures; Volume 2 covers random number generation and arithmetic; Volume 3 covers searching and sorting, going through various sort routines and some more advanced data structures, such as B-trees, as well as developing the whole theory behind hash tables and how to choose hashing functions and table sizes; and Volume 4 covers combinatorial algorithms. Volume 4 is split into three subvolumes; right now, only Volume 4A has actually come out.

If you want to get just one book, I would get Skiena or CLRS. They include all the most important topics both in practice and for undergraduate classes. Skiena, as a bonus, even includes some computational geometry material, in case you’re doing video games or computer graphics.

Theoretical vs Practical

Skiena is the most relentlessly practical of this bunch. Knuth was actually pretty practical at the time Volume 1 came out, but he became less and less practical as time went on because people stopped writing things in assembly or using tape drives. Both give you implementation tips to make your code perform better, although Knuth’s can be hard to capitalize on since you’re not writing in assembly.

CLRS and Dasgupta are both theoretical in the sense that they mostly ignore the computer. Everything in CLRS will work, but sometimes their approach is too “straight from the theory”, as I discovered when I implemented Karp-Rabin according to their discussion and put it on Code Review.SE after struggling with numerous overflow-related bugs, only to have someone suggest a different approach that rectified all the performance issues and handled overflow elegantly.

Dasgupta and CLRS are both still good books, and if you’re just starting out learning algorithms, don’t get too caught up on this issue. Write your code in Python or Ruby, some quick and easy language that also ignores the metal. You’ll get the idea of the implementation, and you can deal with the issues around implementing it in some other language later. (Python even looks like the pseudocode in CLRS.)

Conclusion

I probably use CLRS the most, because of its breadth of coverage and because its pseudocode is so much easier to follow than Knuth’s. If I don’t understand the concept or the mathematical underpinnings, I’ll go to Knuth for the deep discussion, and the excellent prose style.

I just got Skiena recently, but in the future, I expect him to usurp CLRS somewhat. He covers most of the same material, but his prose style is more direct and his approach is more practical. Skiena is excellently organized and has a catalogue of problems and their algorithmic solutions in the second half of his book, good for browsing when you’re stumped.

I don’t use Dasgupta that much. Dasgupta was the text for my undergrad algorithms class, and while it was good in that capacity, it’s not really a book that continues to be useful once you’re past the first-semester course, mainly because of the lack of breadth in coverage and the eccentric organization and choice of topics.

Advertisements

What the Heck is Dynamic Programming, and Why Should I Care?: Being the first in a series on dynamic programming

Dynamic programming. It sounds like one of those Agile methodologies, like Extreme Programming. If you look into it expecting to find project management philosophies, though, you’ll quickly become horrified by the amount of weird-looking math that’s involved. Still, quantitative measures of productivity are good, right?

Dynamic programming is actually not dynamic, and it’s not programming, and it’s definitely not an Agile methodology. The name ‘dynamic programming’ is one of the worst, most misleading names in computer science, right alongside ‘computer science’, which makes non-computer scientists think that we fix people’s computers, but we’re really pretentious about it and think fixing people’s computers puts us on the same level as Albert Einstein and Jonas Salk. But the only other name considered for dynamic programming was ‘bottom-up recursion with memoization’, and that’s about as imposing as it gets, right alongside ‘first order homogeneous linear partial differential equation’, or the ‘infinite-dimensional Banach spaces’ that Guy Steele said in Coders at Work that he couldn’t ever get the hang of.

Still, the name ‘bottom-up recursion with memoization’ at least has the advantage of being accurate, so let’s start with that and try to make it less imposing. After we’ve explored the many parts of that sadly unused name, I’ll sketch out a problem of the type that dynamic programming might be good for. With that example in hand, you’ll hopefully have a better understanding of all the parts. Then I’ll talk a little about why you should care.

Bottom-up recursion with memoization

Dynamic programming is recursion. Whether that helps you or not depends on how much you like recursion (most people don’t much like it).

Recursion tends to work on problems with recursive structure, i.e. a structure where big problems can be cut into smaller problems with the same form, which can be cut into even smaller problems, down and down and down, until you reach problems so small that, hey, you know the answer! Recall the factorial: n! = n(n-1)(n-2)\cdots3\cdot2\cdot1. This function also has a recursive definition: 1! = 1; n! = n(n-1)!. This says that the factorial of 1 is 1, and the factorial of any other natural number n is n times the factorial of n-1. So 4! = 4(3!) = 4\cdot3(2!) = 4\cdot3\cdot2(1!) = 4\cdot3\cdot2\cdot1.

Since dynamic programming is recursion, it also tends to work on problems with recursive structure. It’s often applied to problems with optimal substructure. This means the problem has a recursive structure, but you can cut a problem into smaller problems in lots of different ways. Each of these smaller problems will have its own solution; however, if you can find the best answer among the answers to the smaller problems, you can use that knowledge to get the best answer to the bigger problem. “Best” here means “best” in whatever sense you need it to; you can think of it like giving a solution a rank on a scale of one to five stars, depending on how well it answers your original question. So in problems with optimal substructure, there are usually lots of correct answers, but they won’t always be equally good answers according to your criteria of what makes up a good answer.

Dynamic programming is a technique for solving optimization problems. An optimization problem is one with lots of correct answers, but with some answers better than others according to some measure other than right vs. wrong. Finding the shortest path in a graph, finding the longest increasing subsequence of a sequence of numbers, and pricing some items to get maximum profit are all optimization problems. They involve words like “shortest”, “longest”, “maximum” that imply some kind of measure of quality on the answers you find. And all of these problems can be solved with dynamic programming.

With memoization

Memoization is actually the easiest part of dynamic programming to understand. If you understand hash tables, you understand memoization. You don’t even have to understand how hash tables work; you just have to understand the idea that they can look up stuff really fast.

In normal recursion, your intermediate results are all stored on the call stack. Let’s look at our good friend the Fibonacci sequence in Python.

def fib(n):
    "Calculates the nth Fibonacci number."
    return fib_helper(0, 1, n-2) # 0 and 1 are the first two.

def fib_helper(a, b, n):
    "Does the actual recursion for fib(n)."
    if n == 0:
        return a+b
    else:
        return fib_helper(b, a+b, n-1)

That works, but what if we need to calculate all the Fibonacci numbers between 1000 and 2000? Then we might end up writing something like this:

for i in range(2000):
    f = fib(i)
    if 1000 < f < 2000:
        print(f)

Every time we call fib(i) inside that loop, it calculates the Fibonacci numbers from 0 to i, even though the first i-1 Fibonacci numbers haven’t changed at all since the last time we calculated them. We can’t start at the first Fibonacci number above 1000 because we have no idea what value to give fib to get it. Even if we looked it up somewhere, we’d still end up calculating all the Fibonacci numbers below it a bunch of times and then throwing them away.

This is where memoization comes in. Because looking things up in hash tables (or lists, or arrays) is so fast, we can save a lot of time if we save our calculations in a data structure and look them up when needed. We only calculate a value if it’s not stored in the data structure already.

Here’s a memoizing version of the Fibonacci numbers.

memo = {0: 0, 1: 1}

def fib_memo(n):
    "Calculates Fibonacci sequence with memoization."
    if n in memo:
        return memo[n]
    else:
        memo[n] = fib_helper(0, 1, n - 2)
        return memo[n]

def fib_helper(a, b, n):
    if n <= 0:
        return a + b
    else:
        return fib_helper(b, a+b, n-1)

We first declare a memo, a dictionary whose keys are values of n and whose values are the values of F_n, the nth Fibonacci number. Whenever fib_memo is called, it checks the memo to see if we’ve ever calculated F_n before and stored it there. If so, it just returns that stored value. If not, it calls fib_helper to calculate F_n and stores it in the memo for future reference.

Now when we run the code

for i in range(2000):
    f = fib_memo(i)
    if 1000 < f < 2000:
        print(f)

then instead of recalculating a bunch of Fibonacci numbers each time, we calculate them once, and every time we need them after that, we get them out of a dictionary. Dictionary lookups are super-fast; Python implements its dictionaries as open-address hash tables in C, which means that looking something up is basically a single array access.

That’s pretty much all there is to memoization; it just means storing stuff in a data structure with fast lookup in case you need it again, so you can look it up instead of recalculating.

Bottom-up recursion with memoization

Most common cases of recursion that you see in programming classes are cases of top-down recursion. This means that you start with a big instance of the problem and cut it down until you get instances that are easy to solve.

For instance, with the factorial, you start with a big instance, n!. You cut it into a smaller instance of the problem by rewriting it as n \cdot (n-1)!—you replace an instance with a big factorial by an instance with a smaller factorial. Then you cut down (n-1)! by replacing it with an instance that has an even smaller factorial, (n-2)!.

Mergesort is another example of top-down recursion; you start with a big array and cut it in half. Then you cut the halves in half. Then you cut the fourths in half, and so on, until you have a bunch of arrays with one element, which are already sorted, because there’s nothing to compare them to.

But think about how the Fibonacci function above worked. We didn’t start with a big instance and cut it down into smaller instances. Instead, we started with two small instances—calculating F_0 and F_1, which we already knew were 0 and 1—and built them up into a bigger instance. We could do this because of the formula F_n = F_{n-1} + F_{n-2}, which tells us how to get a bigger Fibonacci number from two smaller ones. In the fib_helper function, in order to know when to stop, I introduced the parameter n, which counts down. When it reaches 0, the recursion stops.

This type of recursion, where you start with small instances and build up bigger ones, is called bottom-up recursion. Dynamic programming algorithms are pretty much all bottom-up recursive, and the reason is optimal substructure. If a problem has optimal substructure, it means that the best solution to a big instance can be computed using the best solutions to smaller instances. We need to know the results from smaller instances to calculate the results of bigger instances.

In the Fibonacci number problem, the formula for F_n depends on the previous two numbers, so to calculate F_n, we have to have F_{n-1} and F_{n-2}. In some problems with optimal substructure, the solution to an n-sized instance depends on the solutions to all the smaller instances, from n-1 all the way down to 1.

Many problems can be solved with either top-down or bottom-up recursion; for instance, we could calculate n! using bottom-up recursion with a function like this:

def bottom_up_factorial(n):
    return factorial_helper(1, 1, n)

def factorial_helper(a, x, n):
    if a == n:
        return x
    else:
        return factorial_helper(a+1, x*(a+1), n)

And here’s a function that calculates Fibonacci numbers with top-down recursion:

def fib(n):
    if n == 0 :
       return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)

You can even memoize top-down recursion, although it doesn’t always buy you anything. I don’t know of any problems that really benefit from using top-down recursion with memoization over using dynamic programming.

Bottom-up recursion with memoization

The hook that ties together these three things—bottom-up, recursion, and memoization—is optimal substructure. This will probably become more clear next time when we look at a problem, but for now just try to understand as much as you can.

Having optimal substructure implies that instances of a problem are somehow composed of smaller instances of that problem. This leads us to recursion.

In most of the problems solved by dynamic programming, the same smaller instances of the problem get examined over and over again when we find the answers for larger instances. We use memoization to make this more efficient—instead of recomputing the answers every time we need them, we memoize them the first time and just look it up after that.

We benefit from using bottom-up recursion in this situation for two reasons: it’s more efficient, and it’s easier to understand. If we start with the small instances and calculate their solutions, then we have all those solutions memoized. When we start working on a bigger instance, we know all the smaller answers we need are already stored in the memo. If we did top-down recursion, it might be the case that some of the smaller solutions we needed hadn’t been memoized yet. Then we would have to stop and calculate them. We’d need extra checks to make sure we had the solution, and extra recursive calls to calculate ones we didn’t have, and the program flow is a lot messier. With bottom-up recursion, every value we need is already memoized, so it’s fast and the program flow is cleaner.

That’s vaguely why problems with optimal substructure are so nicely solved by using bottom-up recursion with memoization. Next time we’ll go over some example problems, and hopefully this will all be more clear.

Postscript

I want to talk a little about why you should care about dynamic programming.

Dynamic programming is an advanced algorithm design technique. If you don’t think you need it, you probably don’t. But there are some very nasty problems that we really want to solve, and dynamic programming turns out to be perfect for them.

Weirdly, dynamic programming turns out to be a really good technique for optimization problems on strings. If you’ve ever used TeX, you might have noticed that it makes really nice paragraphs for you when you output to PDF. In Digital Typography, Donald Knuth talks a little about his algorithm for breaking up text into lines that form nice paragraphs. He frames it as an optimization problem: given a big mass of text, there are tons of ways you can choose to break the text into lines that form a paragraph. The problem is to find the “best” way. Knuth invented a way of measuring the “badness” of putting a break in a particular spot. The badness depends on things like how much extra space would need to be added to make the line stretch to the edge of the margin, and whether you have to split a word in half with a hyphen to put a break there. The algorithm uses dynamic programming to build lines, bottom-up, and finds the paragraph with the smallest badness over the line breaks.

Another place where dynamic programming is useful is bioinformatics. This field involves the analysis of genetic sequences gathered in biological research. DNA is made of molecules called nucleotides, strung together into chains. The structure comes from the nucleotide’s nitrogenous base; nucleotides can have one of the four bases adenine, guanine, cytosine, and thiamine. RNA is similar, but it has uracil instead of thiamine.

We can represent a strand of DNA as a string of the letters A, G, C, and T. These strings can be megabytes or gigabytes of text. (You can think of these strings like biological machine code; instead of each bit having the two states, 0 and 1, they have four states.) There are various tasks we might want to do with these strings, like searching for matches between them—maybe we’ve found a particular gene in gorillas, and we want to know if humans also have it.

Dynamic programming is an excellent way to analyze these genetic strings. If you’re interested in learning more about algorithms on genetic data, the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, is very good. Gusfield taught my undergraduate algorithms class at UC Davis and is responsible for most of what I know about dynamic programming; he’s an excellent teacher, good at delivering clear explanations, and somewhat well known in the bioinformatics field, and the book is just as clear as his lectures, and beautifully typeset.

(I’d like to say I was hanging out with Gusfield and soaking up his knowledge while I was in his class, but the truth is we never spoke. I had a two-hour drive to get home, and his office hours were at six at night.)

Anyway, those are a few reasons why you should care, even if you probably won’t ever have to invent a dynamic programming algorithm.