IDSC II: Lists

Lists are the simplest data-structures that can easily grow. In this post, as part of the immutable data-structure canon, we’re going to look at how immutable lists are implemented in Clojure.

Let us review the definition of list in general. A list is either empty or has a head and a tail. The head contains a data element and the tail is another list. A simple recursive definition. The list (1 2 3) for example has 1 as its head and the list (2 3) as tail. The head of that list is 2 and the tail is the list (3). Finally, that list has 3 as head and the empty list as tail.

List Illustration

This structure is implemented as clojure.lang.PersistentList in the Java part of Clojure.

We can create a list using the list function

(def l1 (list 1 2 3))

We also saved the list in a binding named l1. The general convention (at least in Lisps) is that adding elements happens at the front of the list. To add an element, we use cons.

(def l2 (cons 42 l1))
l2
=> (42 1 2 3)
l1
=> (1 2 3)

List Illustration

cons returns a list with the added element, but l still contains only 1 2 3. This property is called persistence. Conceptually this is easy. We create a new list with 42 as the head and what we called l as the tail. l still refers to the same list as before.

Because the addition does not have to go through the existing list, this operation takes constant time.

Given the recursive definition of list, immutably and persistently removing elements from the front of the list is simple as well. Things get trickier when we concatenate two lists.

(def ls (concat l1 l2))
l1
=> (1 2 3)
l2
=> (42 1 2 3)
ls
=> (1 2 3 42 1 2 3)

In a mutable implementation, we could simply change the last element of the first list to refer to the second list as tail, but that would destroy the persistence of l1.

List Illustration

What is the performance expectation here? You might be tempted to say, that it is constant, because we only need to change one pointer. But remember, that we are talking about a singly-linked list. That means, we can only find the last element of l1 by traversing the whole list. Thus the operation takes linear time.

Back to our original question, how do we concatenate two lists without modifying them?

Then we need to copy the lists, right?

For Clojure the answer is both yes and no.

Clojure’s solution here could be described as “copying light”. We create a new data-structure that has new entries for all the elements, but each entry is only really created when it is used. This is called a lazy sequence.

A lazy sequence behaves like a normal collection. The special power of lazy seqs is that they don’t contain all the data elements, they only know how they are generated. This means that the values in the seq are only created when they are accessed, allowing expensive calculations to be postponed. Lazy seqs can even represent infinite data sets – as long as the data-structure is not traversed completely. We will look into this construct in greater depth in the next part of this series.

Back to our immutable lists. How do lazy seqs help with concatenating two lists? The concat function does not actually return a lists, it returns a lazy seq. When we go through the list, a new list entry is created. Thus for the whole list, we get a newly created entry of each element of both lists, resulting in linear complexity. The data elements, however, are not copied as we can be sure that they are also values and thus won’t change.

Here is the implementation of concat in core.clj:

(defn concat
  "Returns a lazy seq representing the concatenation of the elements in the supplied colls."
  ([] (lazy-seq nil))
  ([x] (lazy-seq x))
  ([x y]
    (lazy-seq
      (let [s (seq x)]
        (if s
          (if (chunked-seq? s)
            (chunk-cons (chunk-first s) (concat (chunk-rest s) y))
            (cons (first s) (concat (rest s) y)))
          y))))

(I’ve omitted the implementation of concatenating more than two lists, as that is not essential to this discussion.)

Let us walk through this code. There are three definitions of of the function here, the dispatch depends on arity (i.e. with how many parameters this function is called). Lines 3 and 4 handle the cases that concat is called with zero or one parameters the obvious way, by just returning a lazy seq with the same contents that were passed as parameters.

Starting in line 5, we have the “real” case for concat: two parameters. Line 6 tells us that the result is a lazy seq. Lines 7-12 are only executed when an element of the list is read – that’s the laziness.

Line 7 makes sure the rest of the function is working with a sequence type, and line 8 makes sure that the first list to concatenate is not nil, otherwise the second list is returned in line 12.

Chunked seqs are an optimization that we’ll ignore here, so we can ignore lines 9 and 10. That leaves us with line 11. Here, we take the head of the first list, and add it (using cons) to the result of a recursive call to concat passing the tail of the first list and the second list.

When we go through the above example of concatenating (1 2 3) and (42 1 2 3), we get a lazy seq back immediately, without the function looking at any of the elements. Only when we go through that result seq, lines 7-12 get invoked. For the first element, we get an entry that contains 1 as head and another lazy seq – the result of (concat (list 2 3) (list 42 1 2 3)) – as the tail. Getting the second happens the same way, as we resolve the second lazy seq. The third, ditto. After the third element, the first list passed to the inner call to concatis empty, so that the else part of the if in line 8 kicks in, and the result of the call is simply the second list (line 12).

In summary

  • Lists in clojure are immutable
  • By default, elements are added to the front of the list
  • The linked structure makes it easy to implement immutable adding of elements
  • Concatenating lists results in a lazy sequence where new list entries are created “on-demand”
  • The data elements are not copied, as they are immutable values