Lazy Sequences in Clojure
Jun 1, 2010
Clojure has the wonderful ability to represent and manipulate infinite sequences using small, finite amount of memory. Look at the following example:
user=> (def p (iterate inc 1)) user=> (take 10 p) (1 2 3 4 5 6 7 8 9 10)
The `iterate' function: (iterate f x) - generates an infinite sequence x, f(x), f(f(x)), f(f(f(x))), and so on. How is it that an infinite sequence is stored in a small, finite number of bytes in memory? The logic is simple - there is really nothing infinite about the sequence - what we have in memory is a simple two element sequence, the first element being `x' itself and the next element a function which will compute the subsequent elements of the sequence when required. When we request the second element of the sequence, this function computes the second element plus yet another function which will yield the remaining elements of the sequence and so on.
The `take' function in the above examples gives us the first 10 elements of the sequence.
Though infinite sequences are easy to use, there is a subtle problem. Let's try an experiment:
user=> (def p (range 0 1e7)) user=> (time (reduce + p)) user=> (time (reduce + p))
On my slow AMD Sempron system, the first `reduce' took around 12 seconds while the second one took only 1.1 second! Why is this so? The `range' function returns a lazy sequence. The first time `reduce' was called, each element of the sequence had to be computed. The next time `reduce' is called, Clojure no longer has to compute the individual elements - the elements `cached' during the previous call to `reduce' can be used directly. This seems to be a good idea.
Let's look at another example:
user=> (def p (range 0 1e11)) user=> (nth p 1000000000)
Keep the `top' command (which displays information about system memory, processes etc continuously) running in another console before executing the above two functions. You will observe that as soon as you call the function `nth', available memory starts dropping; very soon your system is forced to use swap memory; if you wait long enough, you may even be able to see your program crashing.
To complicate things further, you will see that the following code fragment:
user=> (nth (range 0 1e11) 1000000000)
works perfectly and does not consume excessive amount of RAM.
Why is this so?
The nth element of a sequence can be visited only after visiting the first (n-1) elements. As each element is `visited', it is computed and stored in memory. Because you have a reference `p' pointing to the first element of the sequence, none of these computed elements can be freed up. So, if `n' is sufficiently large, very soon all available memory gets eaten up by these `cached' elements.
In the second case, we have no reference pointing to the head of the sequence. The JVM is free to garbage collect the memory used by the already computed elements of the sequence to make way for the newly computed elements.