IDSC IV: Creating and Growing Vectors

Vectors provide constant time random access to any element referenced by an index. Like their fixed length cousins, arrays, they are usually implemented by storing their elements in consecutive memory locations. Such a strait forward implementation, however, doesn’t allow for immutability – at least not when we are interested in the performance characteristics, but in that case we could go with lists anyway.

This post is part of the immutable data-structure canon. While the implementation of immutable lists is relatively simple, immutable persistent vectors require quite a bit of work. I decided to split this topic into two parts: this post covers how vectors are created and how elements; we’ll also learn about the structure used to store the values. The second part to be posted next week, covers how vectors are accessed and how elements are modified and deleted.

Before we dive in, I need to correct an error in the previous parts of the series. Until now, I used “sequence” and “seq” synonymously as a generic term for data-structures that can hold multiple values. I realized now that the correct generic term is collection. Sequence in the Clojure context is a collection that a series of values without reordering them where values may or may not exist yet. seq specifically refers to an API for using collections with the funtions first and rest. Henceforth I shall use the right terminology. And now: vectors.

To achieve both immutability and performance, Clojure only stores limited segments of a vector in a row in memory. The overall structure used to represent the vector behind the scenes is a tree. Each of the tree’s nodes holds an array that contains a segment of the vector. Functions that “modify” a vector return a new object that shares all the nodes not affected by the operation with the original. Only the segments that are different get stored in separate node objects.

The implementation of vectors is in the Java class clojure.lang.PersistentVector. Instances of that class provide the API through which vectors are used and they hold a reference to the root of the tree. The nodes are implemented in the internal class Node.

Creating Vectors

To create a vector in the Java part of the language, you call the static method create from PersistentVector.java:

static public PersistentVector create(List items){
	TransientVector ret = EMPTY.asTransient();
	for(Object item : items)
		ret = ret.conj(item);
	return ret.persistent();
}

The actual work here is done by TransientVector, an internal class used for efficiently modifying vectors. The method starts by creating an empty TransientVector (line 2). The elements are inserted into it by one by calling conj (lines 3, 4). Finally, the resulting transient is converted to a persistent representation (line 5).

Transients are a feature of Clojure that allows you (and the language’s internals) to use mutable data-structures in performance critical parts. Transients are thread-save and they cannot be shared with other code. As we’re looking at immutable data-structures here, we won’t go into the details of the implementation here. The vector that we get from the call to persistent has the same structure we would get by starting with an empty persistent vector and adding the values with immutable operations. So let’s look at that.

Adding Elements

When elements are added to a PersistentVector, the cons method is called.

public PersistentVector cons(Object val){
	int i = cnt;
	//room in tail?
	if(cnt - tailoff() < 32)
		{
		Object[] newTail = new Object[tail.length + 1];
		System.arraycopy(tail, 0, newTail, 0, tail.length);
		newTail[tail.length] = val;
		return new PersistentVector(meta(), cnt + 1, shift, root, newTail);
		}
	//full tail, push into tree
	Node newroot;
	Node tailnode = new Node(root.edit,tail);
	int newshift = shift;
	//overflow root?
	if((cnt >>> 5) > (1 << shift))
		{
		newroot = new Node(root.edit);
		newroot.array[0] = root;
		newroot.array[1] = newPath(root.edit,shift, tailnode);
		newshift += 5;
		}
	else
		newroot = pushTail(shift, root, tailnode);
	return new PersistentVector(meta(), cnt + 1, newshift, newroot, new Object[]{val});
}

For some reason, the method starts with initializing a variable i that is never used (line 2). After that, we are confronted with the details of structure used to store the values.

Why tail? Wasnt this supposed to be a tree?

Trees don’t have tails – not even the strange computer science kind that grows downward. The tail here is not part of the tree, it’s an array holding values that are not part of the tree. Up to 32 values end up in the tail, the rest goes into the tree. As the name suggests, the last elements of the vector get stored in the tail.

32 is a magic number for vectors: the leaves of the tree have exactly 32 values, the other nodes have up to 32 children, and the aforementioned 32-element tail. The tailoff method called in line 4 returns the greatest multiple of 32 that’s less than the current length, i.e. the number of elements in stored in the tree. The field cnt is the length of the vector.

When there is still room in the tail, the tail array is copied into a new array (lines 6, 7) and the new value is appended (line 8). The return value is a new PersistentVector instance with the new tail array, an incremented count, and otherwise the same properties as the original vector (line 9). In this case adding an element is a constant time operation.

Only when the vector to which we’re adding has exactly 32*n elements (with n being a positive integer), we need to deal with the tree. In that case, a new tree node to hold the original vector’s tail is created (line 13). Then we create a tree with the new node in it (lines 14-24), and we return a new instance with the updated tree, the incremented count, and a new tail array that only contains the added element (line 25).

Hey, dont just hand-wave about the tree construction

When constructing the tree, we have two cases: if the tree is completely full at the current height, the resulting tree needs a new level (lines 18-21). Otherwise, we only need to find the right place for the new node (line 24).

The field shift is the height of the tree multiplied by 5. The height is represented in this form so that it can be used with the count and the capacity in bit-shift operations for additional efficiency. When we add the 33rd element, cnt is 32, as that is the value of the original vector. shift is 5, the initial value when there is only an empty root node. As a consequence, the root of the tree for the returned vector is constructed by calling pushTail.

private Node pushTail(int level, Node parent, Node tailnode){
	int subidx = ((cnt - 1) >>> level) & 0x01f;
	Node ret = new Node(parent.edit, parent.array.clone());
	Node nodeToInsert;
	if(level == 5)
		{
		nodeToInsert = tailnode;
		}
	else
		{
		Node child = (Node) parent.array[subidx];
		nodeToInsert = (child != null)?
		                pushTail(level-5,child, tailnode)
		                :newPath(root.edit,level-5, tailnode);
		}
	ret.array[subidx] = nodeToInsert;
	return ret;
}

The method starts by calculating the index where to add the new node depending on the length of the original vector and the level of the tree where we are looking to insert the node (line 2). In the example of adding the 33rd value, the index of the new node is 0, as this is the first node to be pushed into the root. When we add the 65th element – thus having filled up the tail for the second time – we get the index 1.

The return value of pushTail is a new node that clones the references to all the children of the original parent node, so that the child nodes are shared between the old and the new vector (line 3). If we are at the level where the child nodes are leaves, we directly insert the node we’re supposed to add at the calculated index of the returned node (lines 7, 16, 17).

When the tree is already higher, we cannot add the leaf node directly to the parent node. Instead we find the child node at the calculated index (line 11) and use the recursive nature of trees to call pushTree again for that subtree with a lower level (line 13). If no child node in the calculated place exists, we call newPath to create a new subtree of the appropriate hight that contains nodes with exactly one child down to our newly added leaf node (line 14). These operations only copy the subtree where the new value is added. Because of the 32-children condition these are at most O(log32 N) steps.

This was the part of the insertion process, when we can insert into the tree without adding another level. Back in the cons method, we still need to look at the case where we grow the tree in height (lines 18-21). Here we create an all new root for the returned vector (line 18) that gets the original vector’s root as the first child (line 19) and a new subtree of the right height with the new leaf node created by newPath as the second child (line 20). Finally, the field shift for the new vector is incremented by 5 to reflect the new height (line 21). As the number of steps for this operation is bounded by the height of the tree, the complexity here is also O(log32 N), which is consequently the complexity of the whole algorithm.

Summary

In this post we have mainly looked at adding elements to immutable vectors. The algorithm and data-structure described here is used for both creating and growing vectors, although the creation process uses optimizations based on mutable state which we did not discuss.

I want to close this post with an illustration of the structure using a before/after example. Let us assume we have a vector with 1056 elements and want to append another value calling cons.

Growing Vector

The old vector (shown in blue) has a root node with 32 children each holding 32 values, and a tail with 32 more values. This means the tree is full at is current level. The new vector (shown in red) has another layer in its tree. The first child of the new root is the old root, thus sharing the structure of the old vector without copying. The second child of the new root is a node that has one leaf as child which contains a reference to the old tail – again, sharing not copying. The tail of the new vector contains a single value, the one we added.