The Immutable Data-Structure Canon
In the April issue of Communications of the ACM, George V. Neville-Neil (a.k.a. Kode Vicious) reviews the fundamental data-structures – array, list, tree, and hash table. Because of the importance of data-structures for all programming, he calls it the “data-structure canon” (requires subscription). Taking a cue from that article, I’m starting a series of posts about a special kind of implementation of the basic data-structures: the immutable lists, vectors, and maps in Clojure.
Clojure
Clojure is a Lisp dialect for the Java Virtual Machine with a focus on functional programming. It has strong support for clean multithreaded design and a key ingredient are immutable, persistent data-structures.
Immutability is important for clean multithreaded programs, because it allows you to separate values from identities. Rich Hickey, Clojure’s creator, defines these terms as follows:
By identity I mean a stable logical entity associated with a series of different values over time. Models need identity for the same reasons humans need identity – to represent the world. How could it work if identities like ‘today’ or ‘America’ had to represent a single constant value for all time? Note that by identities I don’t mean names (I call my mother Mom, but you wouldn’t).
So, for this discussion, an identity is an entity that has a state, which is its value at a point in time. And a value is something that doesn’t change. 42 doesn’t change. June 29th 2008 doesn’t change. Points don’t move, dates don’t change, no matter what some bad class libraries may cause you to believe. Even aggregates are values. The set of my favorite foods doesn’t change, i.e. if I prefer different foods in the future, that will be a different set.
When you have pure functions dealing with values, problems with multiple threads simply go away, because there is no danger of a value being changed from under a function’s backside. Of course, the world is not as easy as that, and programs need to model state that can change.
In Clojure, an identity can be in different states – i.e. have different values – over time. Identities are implemented as atomic references to values using so called Refs and Agents. But these constructs do not concern us in this series, I mentioned them only to give a background on the rationale for using immutable data-structures to model values.
Abstractly speaking, new values are functions of old values. For example, let us define a list and bind it to the name l1.
(def l1 (list 1 2 3))
When we want to add another element, we do not modify the existing list, but we call the function cons create new list with one more element.
(def l2 (cons 42 l1))
l2
=> (42 1 2 3)
l1
=> (1 2 3)
If we had passed l1 to a function in another thread, it could happily work on the initial three elements without being bothered by our addition of 42.
Your initial impression is probably that there is a lot of copying involved in the implementation of the immutable data-structures, but that ain’t so. Clojure’s data-structures maintain close to the same performance guarantees that their mutable counterparts make.
In the following posts, we look at the implementation of lists, vectors, and maps one by one.
