What is the difference between a cyclic list and a

2019-04-03 18:31发布

问题:

Referencing @dfeuer's answer to this question: Least expensive way to construct cyclic list in Haskell, which says that using cyclic lists 'defeats' the garbage collector as it has to keep everything you've consumed from a cyclic list allocated till you drop the reference to any cons cells in the list.

Apparently in Haskell a cyclic list and an infinite list are two separate things. This blog (https://unspecified.wordpress.com/2010/03/30/a-doubly-linked-list-in-haskell/) says that if you implement cycle as follows:

cycle xs = xs ++ cycle xs

it is an infinite list, not a cyclic list. To make it cyclic you have to implement it like this (as is found in the Prelude source code):

cycle xs = xs' where xs' = xs ++ xs'

What exactly is the difference between these two implementations? And why is it that if you are holding onto one cons cell somewhere in a cyclic list, that the garbage collector has to keep everything before it allocated as well?

回答1:

The difference is entirely in the memory representation. From the point of view of the semantics of the language, they're indistinguishable—you can't write a function that can tell them apart, so your two versions of cycle are considered two implementations of the same function (they're the exact same mapping of arguments to results). In fact, I don't know if the language definition guarantees that one of those is cyclical and the other infinite.

But anyway, let's bring out the ASCII art. Cyclical list:

   +----+----+                 +----+----+
   | x0 |   ----->   ...   --->| xn |    |
   +----+----+                 +----+-|--+
     ^                                |
     |                                |
     +--------------------------------+

Infinite list:

   +----+----+
   | x0 |   ----->  thunk that produces infinite list
   +----+----+

The thing with the cyclical list is that from every cons cell in the list there is a path to all of the others and itself. This means that from the point of view of the garbage collector, if one of the cons cells is reachable, then all are. In the plain infinite list, on the other hand, there aren't any cycles, so from a given cons cell only its successors are reachable.

Note that the infinite list representation is more powerful than the cyclical one, because the cyclical representation only works with lists that repeat after some number of elements. For example, the list of all prime numbers can be represented as an infinite list, but not as a cyclical one.

Note also that this distinction can be generalized into two distinct ways of implementing the fix function:

fix, fix' :: (a -> a) -> a
fix  f = let result = f result in result
fix' f = f (fix' f)

-- Circular version of cycle:
cycle  xs = fix (xs++)

-- Infinite list version of cycle:
cycle' xs = fix' (xs++)

The GHC libraries go for my fix definition. The way GHC compiles code means that the thunk created for result is used both as the result and the argument of the application of f. I.e., the thunk, when forced, will call the object code for f with the thunk itself as its argument, and replace the thunk's contents with the result.



回答2:

Cyclic lists and infinite lists are different operationally, but not semantically.

A cyclic list is literally a loop in memory - imagine a singly linked list with the pointers following a cycle - so takes up constant space. Because each cell in the list can be reached from any other cell, holding onto any one cell will cause the entire list to be held onto.

An infinite list will take up more and more space as you evaluate more of it. Earlier elements will be garbage collected if no longer needed, so programs that process it may still run in constant space, although the garbage collection overhead will be higher. If earlier elements in the list are needed, for example because you hold onto a reference to the head of the list, then the list will consume linear space as you evaluate it, and will eventually exhaust available memory.

The reason for this difference is that without optimisations, a typical Haskell implementation like GHC will allocate memory once for a value, like xs' in the second definition of cycle, but will repeatedly allocate memory for a function invocation, like cycle xs in the first definition.

In principle optimisations might turn one definition into the other, but because of the quite different performance characteristics, it's unlikely that this will happen in practice as compilers are generally quite conservative about making programs behave worse. In some cases the cyclic variant will be worse because of the garbage collection properties already mentioned.



回答3:

cycle xs = xs ++ cycle xs            -- 1
cycle xs = xs' where xs' = xs ++ xs' -- 2

What exactly is the difference between these two implementations?

Using GHC, the difference is that implementation #2 creates a self-referential value (xs'), while #1 merely creates a thunk which happens to be the same.

And why is it that if you are holding onto one cons cell somewhere in a cyclic list, that the garbage collector has to keep everything before it allocated as well?

This is again GHC specific. As Luis said, if you have a reference to one cons cell in a cyclic list, then you can reach the whole list just by going around the cycle. The garbage collector is conservative and won't collect anything that you can still reach.


Haskell is pure and where-refactoring is sound... only when you disregard memory usage (and a few other things like CPU usage and computation time). Haskell the language does not specify what the compiler should do to differentiate #1 and #2. GHC the implementation follows certain memory management patterns which are sensible, but not immediately obvious.