Understanding Dynamic Arrays and Linked Lists

Almost all sequential data structures are implemented using either dynamic arrays or linked lists. Both of them are linear structures that grow as you add new items, but sometimes you can really torpedo your program if you use the wrong one. Here I’ll talk about the differences between them, and when one or the other makes sense.

Arrays: Like the Cereal Aisle

Arrays are like the cereal aisle at the grocery store.

cereal-aisle

When you’re shopping and you see the cereal you want, it’s easy to just reach out and grab a box, no matter where it is on the shelf. If we want a box of Honey Nut Cheerios, we just find Honey Nut Cheerios and grab it; we don’t have to even glance at other cereals, and we don’t have to move anything or reorganize anything. As we’ll see later, linked lists do not have this property.

If we worked in the store and wanted to switch the positions of two cereals—say we wanted to swap Frosted Cheerios and Multi-Grain Cheerios—that’s also pretty easy. We just take out every box of Frosted Cheerios, put them on a cart, then take out every box of Multi-Grain Cheerios and put them in Frosted Cheerios’s former spot, then take the boxes of Frosted Cheerios off the cart and put them where Multi-Grain was.

On the other hand, it’s hard to add new things to the cereal aisle. Say that General Mills comes out with a new Crispy Baklava Cheerios, and we wanted to put it next to Honey Nut since they both have a honey theme to them. We’d have to move down whatever that is next to Honey Nut in the picture (Chocolate Cheerios?). But then to move down Chocolate, we’d have to move down what’s next to it, in the purple box. And to move down the purple boxes, we’d have to move down the blue boxes next to them, and so on. If the shelf has space left at the end, and we’re not too fussy about where the new cereal goes, it’s pretty fast to stick it in at the end. It won’t be with the other Cheerios, but maybe we can do a swap so that it’s at least in the Cheerios section, even if it’s not right next to Honey Nut. If the shelf is full, we just can’t do it; we have to replace something, or move everything to a new shelf with more space.

This is also like arrays. Arrays are fixed-size. If you run out of space, you have to allocate a new array with more space and move everything over to it. If you want to put something in the middle, you have to move down everything that comes afterward. Both of these things can end up being huge wastes of time if you do a lot of them.

There’s another way you could deal with arrays having a fixed size: make an array that’s way bigger than you think you’ll ever need. Then you shouldn’t ever run out of space. But this wastes memory; when you make an array of 200 elements to hold a string that contains someone’s first and last names, which usually comes out to less than 40 characters, you’re holding on to memory that could be used for other things. Sure, modern computers have lots and lots of memory, but there’s no need to squander it when there are better options, just for the off chance that you’ll have to handle Finnish names.

Dynamic arrays are arrays that deal with these problems automatically. They grow bigger as you add elements to them, so you don’t have to worry about running out of space. They also balance speed and memory usage; a typical dynamic array implementation sometimes has to allocate new chunks of memory, but it also leaves itself room to grow, while not squandering memory by holding on to tons of space that it won’t need.

C++’s STL vector, Java’s ArrayList, and Python’s list all use dynamic arrays. In Python, you can do this

ls = []
for k in range(1000000):
    ls.append(k)

and it isn’t horrendously slow or wasteful of memory, because Python uses a clever trick behind the scenes. (On my machine, that code runs in about half a second, and my machine is pretty second-rate.) For one thing, every time a Python list needs to expand, it doesn’t just allocate enough space for one more item; it allocates enough for a few. 1 That way we’re not wasting too much memory on taking space which might never be used. On the other hand, we don’t need to allocate new memory and move everything over as often. Since allocating and moving over are time-consuming, this makes a big difference. If we only allocated three extra spaces for each allocation, for example, we’d have to reallocate and move a third as often.

Python also saves on memory allocation by reusing blocks of memory. Sometimes, when a data structure isn’t used anymore in the program, Python doesn’t actually get rid of the memory it uses. Instead, Python puts the memory in a “recycling bin”. The next time it needs memory, it looks in the recycling bin to see if any of the memory blocks there are the size it needs. If so, Python just erases the data in that memory block and stores the new data in it.

C++ and Java use similar schemes to save space and time. If you look at the methods for the vector and ArrayList classes, you’ll find that some of them refer to both a size and a capacity. The size is the number of things currently stored in the array that backs up the structure. The capacity is the amount of memory space in that array. When size == capacity, it’s time to allocate a new block of memory and move everything over.

Linked Lists: Like being a Train Conductor

Linked lists are like being on a train.

4carLinkTrain

Say you’re the conductor on a train and you need to find a specific passenger because it turns out he’s a stowaway with fake papers and you need to throw him off. You don’t know exactly which car he’s in. You’re at the front of the train. In order to find this passenger, you have to walk back through the cars until you see him. He might be in the second car. He might be in the last car. In the worst case, he might have jumped the train a few stops back because he knew you were on to him, and you have to search the entire train for a passenger who’s not even there.

This is how it is searching in a linked list. Since the nodes are hooked together by pointers and are stored at arbitrary places in memory, you have no choice but to look at the data in every node, following the pointers along the list, until you find what you’re looking for or come to the end. It’s not like the cereal aisle where you can just scan for what you want, then grab it right off the shelf.

But being on a train has some advantages, too. Say our engine blows and we need a new locomotive. Just pull into the train yard, unhook the locomotive, drag it into the depot, and then drag on a new one. The caboose is also easy to replace. And if we want to add a new first-class sleeper car in the middle, that’s also not so bad; we can uncouple the cars at the point we want to insert the new car, drag them apart, and use the shunting rails to get the new car moved in between. Then we just couple the new car to both halves of the old train, and we’re done.

Inserting new items in the middle or at the front of a linked list is quick and easy, because all you have to to do is reassign some pointers. The pointers are like the train couplings in the example; you just have to reconnect them, and you have a new node at the center of the list.

Unlike a dynamic array, which can grow, but makes you pay the huge cost of reallocating and moving everything, a linked list can keep on growing, one node at a time, until it’s filled all of memory, without incurring too much of a penalty. You never have to move anything, because it’s easy to attach a new node at the end or wherever you want it.

Where you use them

The dynamic array types (Python’s list, Java’s ArrayList, and C++’s vector) are usually the workhorse sequential collection of their language. Linked lists are usually used for more specialized collections. Since it’s quick and easy to add things at the beginning or end, and you can keep growing them without moving anything, they’re good for stacks and queues.

Linked lists are also good because their structure is flexible. You can do all sorts of weird things with them. For instance, there is the circular linked list. This is a linked list whose final node has a pointer to the first node, so when you traverse the list, you end up at the beginning again when you’re done.

One advantage of doing things this way is that you reduce the risk of dereferencing a null pointer in languages like C and C++. Rather than checking whether the next node is null, you can check whether it’s the same as the head; if you accidentally access the null pointer while you’re doing this, your program will crash, but if you happen to access the head of the list it’s no big deal.

Another advantage of circular linked lists is how clean and quick it is to reuse space. Memory allocation is fairly expensive, so there might be times when you want to keep the same nodes around and just reassign their data. You can make a queue that reuses nodes with a circular linked list. Keep around pointers to the head and the node where the next item will be added. When you remove an item from the queue, just point the head pointer at the next node. Now your queue has another “empty” node at the end. When you insert, check that you’re not trying to insert at the head. This sort of deletion is called lazy deletion, because you haven’t really deleted anything; you’ve just marked it unused (by switching the position of the head pointer).

When it comes to weird things you can do with linked lists, though, nothing beats the skip list. But you’ll have to read about that on your own, because I’m out of time. I hope this post has shed some light on the differences between the two common linear data structures.

  1. See this blog for more on Python’s list implementation.
Advertisements

What the Heck is Dynamic Programming, and Why Should I Care?: Being the first in a series on dynamic programming

Dynamic programming. It sounds like one of those Agile methodologies, like Extreme Programming. If you look into it expecting to find project management philosophies, though, you’ll quickly become horrified by the amount of weird-looking math that’s involved. Still, quantitative measures of productivity are good, right?

Dynamic programming is actually not dynamic, and it’s not programming, and it’s definitely not an Agile methodology. The name ‘dynamic programming’ is one of the worst, most misleading names in computer science, right alongside ‘computer science’, which makes non-computer scientists think that we fix people’s computers, but we’re really pretentious about it and think fixing people’s computers puts us on the same level as Albert Einstein and Jonas Salk. But the only other name considered for dynamic programming was ‘bottom-up recursion with memoization’, and that’s about as imposing as it gets, right alongside ‘first order homogeneous linear partial differential equation’, or the ‘infinite-dimensional Banach spaces’ that Guy Steele said in Coders at Work that he couldn’t ever get the hang of.

Still, the name ‘bottom-up recursion with memoization’ at least has the advantage of being accurate, so let’s start with that and try to make it less imposing. After we’ve explored the many parts of that sadly unused name, I’ll sketch out a problem of the type that dynamic programming might be good for. With that example in hand, you’ll hopefully have a better understanding of all the parts. Then I’ll talk a little about why you should care.

Bottom-up recursion with memoization

Dynamic programming is recursion. Whether that helps you or not depends on how much you like recursion (most people don’t much like it).

Recursion tends to work on problems with recursive structure, i.e. a structure where big problems can be cut into smaller problems with the same form, which can be cut into even smaller problems, down and down and down, until you reach problems so small that, hey, you know the answer! Recall the factorial: n! = n(n-1)(n-2)\cdots3\cdot2\cdot1. This function also has a recursive definition: 1! = 1; n! = n(n-1)!. This says that the factorial of 1 is 1, and the factorial of any other natural number n is n times the factorial of n-1. So 4! = 4(3!) = 4\cdot3(2!) = 4\cdot3\cdot2(1!) = 4\cdot3\cdot2\cdot1.

Since dynamic programming is recursion, it also tends to work on problems with recursive structure. It’s often applied to problems with optimal substructure. This means the problem has a recursive structure, but you can cut a problem into smaller problems in lots of different ways. Each of these smaller problems will have its own solution; however, if you can find the best answer among the answers to the smaller problems, you can use that knowledge to get the best answer to the bigger problem. “Best” here means “best” in whatever sense you need it to; you can think of it like giving a solution a rank on a scale of one to five stars, depending on how well it answers your original question. So in problems with optimal substructure, there are usually lots of correct answers, but they won’t always be equally good answers according to your criteria of what makes up a good answer.

Dynamic programming is a technique for solving optimization problems. An optimization problem is one with lots of correct answers, but with some answers better than others according to some measure other than right vs. wrong. Finding the shortest path in a graph, finding the longest increasing subsequence of a sequence of numbers, and pricing some items to get maximum profit are all optimization problems. They involve words like “shortest”, “longest”, “maximum” that imply some kind of measure of quality on the answers you find. And all of these problems can be solved with dynamic programming.

With memoization

Memoization is actually the easiest part of dynamic programming to understand. If you understand hash tables, you understand memoization. You don’t even have to understand how hash tables work; you just have to understand the idea that they can look up stuff really fast.

In normal recursion, your intermediate results are all stored on the call stack. Let’s look at our good friend the Fibonacci sequence in Python.

def fib(n):
    "Calculates the nth Fibonacci number."
    return fib_helper(0, 1, n-2) # 0 and 1 are the first two.

def fib_helper(a, b, n):
    "Does the actual recursion for fib(n)."
    if n == 0:
        return a+b
    else:
        return fib_helper(b, a+b, n-1)

That works, but what if we need to calculate all the Fibonacci numbers between 1000 and 2000? Then we might end up writing something like this:

for i in range(2000):
    f = fib(i)
    if 1000 < f < 2000:
        print(f)

Every time we call fib(i) inside that loop, it calculates the Fibonacci numbers from 0 to i, even though the first i-1 Fibonacci numbers haven’t changed at all since the last time we calculated them. We can’t start at the first Fibonacci number above 1000 because we have no idea what value to give fib to get it. Even if we looked it up somewhere, we’d still end up calculating all the Fibonacci numbers below it a bunch of times and then throwing them away.

This is where memoization comes in. Because looking things up in hash tables (or lists, or arrays) is so fast, we can save a lot of time if we save our calculations in a data structure and look them up when needed. We only calculate a value if it’s not stored in the data structure already.

Here’s a memoizing version of the Fibonacci numbers.

memo = {0: 0, 1: 1}

def fib_memo(n):
    "Calculates Fibonacci sequence with memoization."
    if n in memo:
        return memo[n]
    else:
        memo[n] = fib_helper(0, 1, n - 2)
        return memo[n]

def fib_helper(a, b, n):
    if n <= 0:
        return a + b
    else:
        return fib_helper(b, a+b, n-1)

We first declare a memo, a dictionary whose keys are values of n and whose values are the values of F_n, the nth Fibonacci number. Whenever fib_memo is called, it checks the memo to see if we’ve ever calculated F_n before and stored it there. If so, it just returns that stored value. If not, it calls fib_helper to calculate F_n and stores it in the memo for future reference.

Now when we run the code

for i in range(2000):
    f = fib_memo(i)
    if 1000 < f < 2000:
        print(f)

then instead of recalculating a bunch of Fibonacci numbers each time, we calculate them once, and every time we need them after that, we get them out of a dictionary. Dictionary lookups are super-fast; Python implements its dictionaries as open-address hash tables in C, which means that looking something up is basically a single array access.

That’s pretty much all there is to memoization; it just means storing stuff in a data structure with fast lookup in case you need it again, so you can look it up instead of recalculating.

Bottom-up recursion with memoization

Most common cases of recursion that you see in programming classes are cases of top-down recursion. This means that you start with a big instance of the problem and cut it down until you get instances that are easy to solve.

For instance, with the factorial, you start with a big instance, n!. You cut it into a smaller instance of the problem by rewriting it as n \cdot (n-1)!—you replace an instance with a big factorial by an instance with a smaller factorial. Then you cut down (n-1)! by replacing it with an instance that has an even smaller factorial, (n-2)!.

Mergesort is another example of top-down recursion; you start with a big array and cut it in half. Then you cut the halves in half. Then you cut the fourths in half, and so on, until you have a bunch of arrays with one element, which are already sorted, because there’s nothing to compare them to.

But think about how the Fibonacci function above worked. We didn’t start with a big instance and cut it down into smaller instances. Instead, we started with two small instances—calculating F_0 and F_1, which we already knew were 0 and 1—and built them up into a bigger instance. We could do this because of the formula F_n = F_{n-1} + F_{n-2}, which tells us how to get a bigger Fibonacci number from two smaller ones. In the fib_helper function, in order to know when to stop, I introduced the parameter n, which counts down. When it reaches 0, the recursion stops.

This type of recursion, where you start with small instances and build up bigger ones, is called bottom-up recursion. Dynamic programming algorithms are pretty much all bottom-up recursive, and the reason is optimal substructure. If a problem has optimal substructure, it means that the best solution to a big instance can be computed using the best solutions to smaller instances. We need to know the results from smaller instances to calculate the results of bigger instances.

In the Fibonacci number problem, the formula for F_n depends on the previous two numbers, so to calculate F_n, we have to have F_{n-1} and F_{n-2}. In some problems with optimal substructure, the solution to an n-sized instance depends on the solutions to all the smaller instances, from n-1 all the way down to 1.

Many problems can be solved with either top-down or bottom-up recursion; for instance, we could calculate n! using bottom-up recursion with a function like this:

def bottom_up_factorial(n):
    return factorial_helper(1, 1, n)

def factorial_helper(a, x, n):
    if a == n:
        return x
    else:
        return factorial_helper(a+1, x*(a+1), n)

And here’s a function that calculates Fibonacci numbers with top-down recursion:

def fib(n):
    if n == 0 :
       return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)

You can even memoize top-down recursion, although it doesn’t always buy you anything. I don’t know of any problems that really benefit from using top-down recursion with memoization over using dynamic programming.

Bottom-up recursion with memoization

The hook that ties together these three things—bottom-up, recursion, and memoization—is optimal substructure. This will probably become more clear next time when we look at a problem, but for now just try to understand as much as you can.

Having optimal substructure implies that instances of a problem are somehow composed of smaller instances of that problem. This leads us to recursion.

In most of the problems solved by dynamic programming, the same smaller instances of the problem get examined over and over again when we find the answers for larger instances. We use memoization to make this more efficient—instead of recomputing the answers every time we need them, we memoize them the first time and just look it up after that.

We benefit from using bottom-up recursion in this situation for two reasons: it’s more efficient, and it’s easier to understand. If we start with the small instances and calculate their solutions, then we have all those solutions memoized. When we start working on a bigger instance, we know all the smaller answers we need are already stored in the memo. If we did top-down recursion, it might be the case that some of the smaller solutions we needed hadn’t been memoized yet. Then we would have to stop and calculate them. We’d need extra checks to make sure we had the solution, and extra recursive calls to calculate ones we didn’t have, and the program flow is a lot messier. With bottom-up recursion, every value we need is already memoized, so it’s fast and the program flow is cleaner.

That’s vaguely why problems with optimal substructure are so nicely solved by using bottom-up recursion with memoization. Next time we’ll go over some example problems, and hopefully this will all be more clear.

Postscript

I want to talk a little about why you should care about dynamic programming.

Dynamic programming is an advanced algorithm design technique. If you don’t think you need it, you probably don’t. But there are some very nasty problems that we really want to solve, and dynamic programming turns out to be perfect for them.

Weirdly, dynamic programming turns out to be a really good technique for optimization problems on strings. If you’ve ever used TeX, you might have noticed that it makes really nice paragraphs for you when you output to PDF. In Digital Typography, Donald Knuth talks a little about his algorithm for breaking up text into lines that form nice paragraphs. He frames it as an optimization problem: given a big mass of text, there are tons of ways you can choose to break the text into lines that form a paragraph. The problem is to find the “best” way. Knuth invented a way of measuring the “badness” of putting a break in a particular spot. The badness depends on things like how much extra space would need to be added to make the line stretch to the edge of the margin, and whether you have to split a word in half with a hyphen to put a break there. The algorithm uses dynamic programming to build lines, bottom-up, and finds the paragraph with the smallest badness over the line breaks.

Another place where dynamic programming is useful is bioinformatics. This field involves the analysis of genetic sequences gathered in biological research. DNA is made of molecules called nucleotides, strung together into chains. The structure comes from the nucleotide’s nitrogenous base; nucleotides can have one of the four bases adenine, guanine, cytosine, and thiamine. RNA is similar, but it has uracil instead of thiamine.

We can represent a strand of DNA as a string of the letters A, G, C, and T. These strings can be megabytes or gigabytes of text. (You can think of these strings like biological machine code; instead of each bit having the two states, 0 and 1, they have four states.) There are various tasks we might want to do with these strings, like searching for matches between them—maybe we’ve found a particular gene in gorillas, and we want to know if humans also have it.

Dynamic programming is an excellent way to analyze these genetic strings. If you’re interested in learning more about algorithms on genetic data, the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, is very good. Gusfield taught my undergraduate algorithms class at UC Davis and is responsible for most of what I know about dynamic programming; he’s an excellent teacher, good at delivering clear explanations, and somewhat well known in the bioinformatics field, and the book is just as clear as his lectures, and beautifully typeset.

(I’d like to say I was hanging out with Gusfield and soaking up his knowledge while I was in his class, but the truth is we never spoke. I had a two-hour drive to get home, and his office hours were at six at night.)

Anyway, those are a few reasons why you should care, even if you probably won’t ever have to invent a dynamic programming algorithm.

Rhapsody in 4Clojure: map-indexed, mapcat, lazy-seq, and lazy-cat

No, I haven’t lost my house. I’ve just become addicted to the site 4clojure.com. If you’ve ever used any of the “koan” sites for learning programming languages, it’s similar to those. I never had before I came to it, and I never will again. It’s too dangerous.

4clojure offers a series of small problems to solve in Clojure. It starts with the most basic of the basic, things like learning which values are truthy and which are falsy; goes through easy stuff like reversing a sequence or finding the penultimate value of a vector; leads into uses for advanced functions like trampoline and juxt; has you implement many of the most important library functions on your own; and finally starts you on harder problems like finding out whether a graph is connected, generating the language of a DFA, and implementing Conway’s Game of Life. I’ve become obsessive about this site, spending hours working on the problems when I should be doing something else. It’s ruining my life.

Ever since I first learned Clojure, I’ve been discouraged from using loop and recur, so I tried really hard to avoid them when I was doing the problems. They’re a lot uglier than direct recursion in Scheme or Haskell, and they can make the code grow in both size and complexity, which are both good reasons to stay away from them when you can. Sometimes you just can’t, but the problems on 4clojure are a great way to learn what your alternatives are. And that’s the theme of this post: four functions I’d never heard of that can get you out of scrapes where it seems like you just have to use loop and recur or Scheme-style direct recursion.

map-indexed

map is a great function. It’s simple, quick, convenient. for is also a great function, although you have to work pretty hard to get to know it—it’s more of a dark, complex loner that you’re always finding out new things about.

Sometimes, despite map‘s greatness, you really do need to know what the index of your item was in the collection you’re mapping over. I used to cover those cases with for, like this:

; Get every fourth item from the collection s
(for [i (range (count s))
      :when (== 0 (mod i 4))]
  (get s i))

That’s kind of hard to read, though, and it seems wasteful somehow—we’re generating a whole other collection just to index our first collection. And we’re looking up every item we want using get. If s is a vector, it’s probably okay, but Clojure lets you treat all sequence types almost identically, so what if it’s something with horrible linear-time lookups?

This is where map-indexed comes in. map-indexed works just like map, except that it takes a function of two arguments. The first argument is the second argument’s index in the collection. To see clearly how it works, run this code:

(map-indexed #(vector %1 %2) '(a b c d e))
; Returns ([0 a] [1 b] [2 c] [3 d] [4 e])

To get every fourth item like we did above using for, you can do this:

(->> s
  (map-indexed #(when (== 0 (mod %1 4)) %2)
  (remove nil?))

To me, it’s more obvious what that code is doing than the version with for. We’re avoiding the creation of a separate indexing collection, and we even got to use our snazzy ->> macro. There’s probably still a better way to do this, but this shows off the usefulness of map-indexed.

mapcat

As evidence for what a lame-brain I am, I didn’t even know about the concat function before I started doing the 4clojure problems. This function is like Common Lisp’s append: it takes a bunch of collections as arguments, and combines them all into one collection.

Now, suppose you wanted to use map for something, and you need the function you’re mapping to return a collection, but you want to have a single collection containing all the items, rather than a collection of collections. For instance, say you have a list containing several vectors, each containing some sets, like this:

(def td '([#{:a, :b}, #{:c, :delete-me}], 
          [#{:and :all :our :yesterdays :have :lighted :fools},    
           #{:the :way :to :dusty :death}], 
          [#{:delete-me, :goodbye-cruel-world}, 
           #{:out, :brief :candle :!}]))

and you want to map over the vectors and get rid of all the sets with the key :delete-me, returning a sequence of all the sets that don’t have that key. Here’s some code that does that.

(map (fn [v]
         (remove #(contains? % :delete-me) v))
         td)

That’s all right, but look at the output:

((#{:a :b}) 
 (#{:lighted :yesterdays :and :have :our :all :fools} 
  #{:to :way :dusty :death :the}) 
 (#{:brief :candle :! :out}))

Aside from the fact that it jumbled the words and ruined Shakespeare’s poetry, every item is still inside a list. So we can do this to fix it:

(apply concat
             (map (fn [v]
                      (remove #(contains? % :delete-me) v))
                  td))
=> (#{:a :b} 
    #{:lighted :yesterdays :and :have :our :all :fools} 
    #{:to :way :dusty :death :the} #{:brief :candle :! :out})

mapcat does exactly what the code above does: it first maps a function (which is expected to return a collection) over something, and then applies concat to the resulting sequence. So we could write the code as

(mapcat (fn [v]
        (remove #(contains? % :delete-me) v)) 
        td)
=> (#{:a :b} 
    #{:lighted :yesterdays :and :have :our :all :fools} 
    #{:to :way :dusty :death :the} #{:brief :candle :! :out})

and get the same effect.

mapcat is also useful when you need the effect of a function that returns multiple values. Just have your function return a collection, and mapcat that function over a sequence. Then the items from all the individual return values will be strung together into a single collection, as if your function did a regular return, but gave back multiple values. An example, which we’ll see again later with lazy-cat:

(mapcat #(repeat % %) (range 5))
=> (1 2 2 3 3 3 4 4 4 4)
;; For each n in (range 5), generates a list containing 
;; n copies of n. E.g. (2 2)---two 2s, (3 3 3)---three 3s, etc. 
;; Concats them all into a single sequence.

Lazy sequences and how to make them

One of Clojure’s selling points is lazy evaluation. Haskell was the first language I ever knew to incorporate this somewhat bizarre feature.

Most popular languages use eager evaluation—everything gets evaluated as soon as it appears. For example, here’s a Python function that crashes the interpreter:

def fib_seq(a, b):
    return [fib_seq(b, a+b)].insert(0, a)

It crashes the interpreter because when you call it, it says “Hold on, before I can get you the fib_seq of a and b, I have to get you a list containing the fib_seq of b and a + b, and then put a on it. But before I can do that, I have to get a list containing the fib_seq of a + b and b + a + b and put b on it. But before I can do that…” This leads to infinite recursion, because the interpreter never finishes getting what it needs; it eventually crashes when you max out the recursion limit.

It’s probably the case that you didn’t really want a list of all the Fibonacci numbers anyway. You probably just wanted a few of them. In eager languages, you’d usually solve this by passing in some kind of limit so the function can check whether it’s done. The limit has to be passed as an argument to every recursive call of the function, so it takes up space on the call stack and clutters up calls to the function. Still, it’s better than crashing the interpreter.

In Python itself, generators are a much nicer way to solve this problem. In fact, they let you achieve behavior very much like Clojure’s laziness. But I just used Python as an example for eager languages in general; in Java or C you don’t have laziness or generators, so you’d have to pass in limits somehow. Still, let’s imagine that Python doesn’t have generators. Wouldn’t it be nice if you could do something like this?

def fib_seq(a, b):
    return [fib_seq(b, a+b)].laze().insert(0, a)

The effect here would be that calling laze() on the list causes any recursive calls inside the list to go unevaluated until they’re actually wanted. It’s as if you’re telling the interpreter “I’d like a Fibonacci sequence, don’t know how big just yet, so could you hold off until I do know?”

Clojure’s version of the fictional laze() method is the lazy-seq function. You define a function that would ordinarily have infinite recursion, like this:

(defn fib-seq [a b]
    (cons a (fib-seq b (+ a b))))

But you wrap the recursive call in lazy-seq, like this:

(defn fib-seq [a b]
    (cons a (lazy-seq (fib-seq b (+ a b)))))

This example comes from the lazy-seq page on ClojureDocs. You wrap the infinite recursion in lazy-seq, telling the compiler “Yes, I’d like infinite recursion, but no, I wouldn’t like it all at once. Hold off on that until later, wouldn’t you?” Later, when you know how much recursion you actually do want, you can use various functions to get it. take grabs the number of items you give it:

(take 17 (fib-seq 0 1))
=> (0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987)

take-while grabs items until the predicate you give it becomes untrue:

(take-while (partial > 1000) (fib-seq 0 1))
=> (0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987)

Here we generate Fibonacci numbers until we find one that’s larger than 1000.

The big deal with lazy evaluation is that absolutely nothing is evaluated until it’s explicitly asked for. Lazy sequences in Clojure have to be realized—this means that some other code has to ask for them to be evaluated. If you just called (fib-seq 0 1), nothing would happen just yet; you’d create a lazy sequence, which you could save somewhere and evaluate later using other functions. This is somewhat similar to the way arguments to macros work—they aren’t evaluated unless you specifically ask for it. Things are different in the REPL, though. The REPL always asks for things to be evaluated right away so it can print them. So if you called (fib-seq 0 1) in the REPL, you’d keep getting Fibonacci numbers until your JVM crashed, just like in the fictional Python example above.

You can also use lazy-seq to create recursive functions that produce sequences, without using loop or stack-consuming direct recursion. One of the 4clojure problems is to implement map. Here’s my solution, which you should not look at if you want to solve the problem on your own.

(fn my-map [f s]
  (if (empty? s)
    nil
    (cons (f (first s)) (lazy-seq (my-map f (rest s))))))

Haskell, to my knowledge, uses lazy evaluation for everything by default, but Clojure uses eager evaluation most of the time, so you have to tell it when you want laziness. Using lazy evaluation for everything can let you do some amazing compiler optimizations, but it can also make reasoning about your code very hard, and it can mess with timing-based things like IO. One reason Haskell uses monads for IO is so that the compiler can make sure to evaluate that code eagerly, so that your IO gets done exactly when you wanted it. Otherwise the compiler might decide that it doesn’t need that code right now, and never evaluate it.

lazy-cat: Not orange, doesn’t eat lasagna

lazy-cat is to lazy-seq as mapcat is to map. If you have a recursive function whose return value is a collection, but you want the items to be part of the outer collection, you can use lazy-cat to produce the sequence.

Here’s an example from ClojureDocs’s lazy-cat page

(defn n-times-n [n]
    (lazy-cat (repeat n n) (n-times-n (inc n))))
(take 12 (n-times-n 1))
=> (1 2 2 3 3 3 4 4 4 4 5 5)

That function starts by taking 1 and calling (repeat 1 1). This returns (1) (i.e. a list with one 1 in it). The function then recurses, with 2, and calls (repeat 2 2), which produces (2 2) (i.e. a list with two 2s in it). If we’d used (cons (repeat n n) (lazy-seq... like before, we’d end up with ((1) (2 2)) as a result. But lazy-cat calls concat, so we instead get (1 2 2) out of this.

Conclusion

There are lots of other useful things I learned on 4clojure, but I’ve already written almost two thousand words, so I’ll leave you to discover them on your own. A lot of the early problems are easier if you know about iterate and intimately understand how for and reduce work. Early on I found myself falling back on loop in cases where one of those three would have solved everything.

Since 4clojure doesn’t let you use defn, it also helps if you know these two things about anonymous functions:

  1. You can give names to anonymous functions created with fn. You can use the name as a recursion point. See my above map example.
  2. You can’t nest two anonymous functions created with the #(...) syntax, but you can do whatever you want with the (fn [arg1, arg2] ...) syntax. To wit:
    ;; Illegal
    #(map #(when (< 20 %) %) %)
    ;; Peachy
    (fn [s] (map #(when (< 20 %) %) s))
    ;; Hunky-dory
    (fn [ks, vs] (map #(when (and (< 20 %1) (> 20 %2)) %2) ks vs))
    ;; Totally kosher
    (fn [ks, vs] (filter #(and (pos? %) (< 100000 %))
                         (map #(when 
                                 (and (< 20 %1) (> 20 %2)) %2) 
                                ks vs)))
    

    And of course, any of the #(...)-using functions could also have used the (fn [args] ...) syntax, say to allow destructuring inside the binding form.

That’s it for now. Just remember to stay out of 4clojure; it’ll ruin your life.