Clojure head retention

2019-01-16 10:26发布

问题:

I'm reading Clojure Programming book by O'Reilly..

I came over an example of head retention. First example retains reference to d (I presume), so it doesnt get garbage collected:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d) (count t)])
;= #<OutOfMemoryError java.lang.OutOfMemoryError: Java heap space>

While second example doesnt retain it, so it goes with no problem:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count t) (count d)])
;= [12 99999988]

What I don't get here is what exactly is retained in which case and why. If I try to return just [(count d)], like this:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d)])

it seems to create same memory problem.

Further, I recall reading that count in every case realizes/evaluates a sequence. So, i need that clarified.

If I try to return (count t) first, how is that faster/more memory efficient then if I dont return it at all? And what & why gets retained in which case?

回答1:

In both the first and the final examples the original sequence passed to split-with is retained while being realized in full in memory; hence the OOME. The way this happens is indirect; what is retained directly is t, while the original sequence is being held onto by t, a lazy seq, in its unrealized state.

The way t causes the original sequence to be held is as follows. Prior to being realized, t is a LazySeq object storing a thunk which may be called upon at some point to realize t; this thunk needs to store a pointer to the original sequence argument to split-with before it is realized to pass it on to take-while -- see the implementation of split-with. Once t is realized, the thunk becomes eligible for GC (the field which holds it in the LazySeq object is set to null) at t no longer holds the head of the huge input seq.

The input seq itself is being realized in full by (count d), which needs to realize d, and thus the original input seq.

Moving on to why t is being retained:

In the first case, this is because (count d) gets evaluated before (count t). Since Clojure evaluates these expressions left to right, the local t needs to hang around for the second call to count, and since it happens to hold on to a huge seq (as explained above), that leads to the OOME.

The final example where only (count d) is returned should ideally not hold on to t; the reason that is not the case is somewhat subtle and best explained by referring to the second example.

The second example happens to work fine, because after (count t) is evaluated, t is no longer needed. The Clojure compiler notices this and uses a clever trick to have the local reset to nil simultaneously with the count call being made. The crucial piece of Java code does something like f(t, t=null), so that the current value of t is passed to the appropriate function, but the local is cleared before control is handed over to f, since this happens as a side effect of the expression t=null which is an argument to f; clearly here Java's left-to-right semantics are key to this working.

Back to the final example, this doesn't work, because t is not actually used anywhere and unused locals are not handled by the locals clearing process. (The clearing happens at the point of last use; in absence of such a point in the program, there is no clearing.)

As for count realizing lazy sequences: it must do that, as there is no general way of predicting the length of a lazy seq without realizing it.



回答2:

Answer by @Michał Marczyk, while correct, is a little difficult to comprehend. I find this post on Google Groups easier to grasp.

Here's how I understand it:

Step 1 Create lazy sequence: (range 1e8). Values are not realized yet, I marked them as asterixes (*):

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ... * * *

Step 2 Create two more lazy seqences which are "windows" through which you look at the original, huge lazy sequence. First window contains only 12 elements (t), the other the rest of elements (d):

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 

Step 3 - out of memory scenario - you evaluate [(count d) (count t)]. So, first you count elements in d, then in t. What will happen is that you will go through all values starting at the first element of d and realize them (marked as !):

* * * * * * * * * * * * * ! * * * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                          ^
                         start here and move right ->

* * * * * * * * * * * * * ! ! * * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                            ^

* * * * * * * * * * * * * ! ! ! * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                              ^

                     ...

; this is theoretical end of counting process which will never happen
; because of OutOfMemoryError
* * * * * * * * * * * * * ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ... ! ! !
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                                                                    ^

Problem is that all the realized values (!) are being retained, because the head of the collection (first 12 elements) are still needed - we still need to evaluate (count t). This consumes a lot of memory causing JVM to crash.

Step 3 - valid scenario - this time you evaluate [(count t) (count d)]. So we first want to count elements in smaller, head sequence:

! * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
^
start here and move right ->

                        ! * * * * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                        ^

Then, we count elements in d sequence. Compiler knows that elements from t aren't needed anymore, so it can garbage collect them freeing up the memory:

                          ! * * * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                          ^

                            ! * * * * * * * * * * * * * * * ... * * *
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                            ^

                     ...

                                                            ...     !
t t t t t t t t t t t t t d d d d d d d d d d d d d d d d d ... d d d 
                                                                    ^

Now we can see that, because elements from t weren't needed anymore, compiler was able to clear memory as it went through the large sequence.