I come from an imperative background and am trying to implement a simple disjoint sets (“union-find”) data structure to get some practice with creating and modifying (persistent) data structures in Haskell. The goal is to have a simple implementation, but I am also concerned about efficiency, and my question is related to this.
First, I created a disjoint-set forest implementation with union by rank and started by defining a data type for a “point”:
data Point = Point
{ _value :: Int
, _parent :: Maybe Point
, _rank :: Int
} deriving Show
A disjointed set forest is an IntMap
with Int → Point
mappings:
type DSForest = IntMap Point
empty :: DSForest
empty = I.empty
A singleton set is simply a mapping from its value x to a Point with value x, no parent and a rank of 1:
makeSet :: DSForest -> Int -> DSForest
makeSet dsf x = I.insert x (Point x Nothing 0) dsf
Now, the interesting part – union
. This operation will modify a point by setting the other point as its parent (and in some cases change its rank). In the case where the Point
s' rank are different, the Point
is simply “updated” (a new Point is created) to have its parent point to the other. In the case where they are equal, a new Point
is created with its rank increased by one:
union :: DSForest -> Int -> Int -> DSForest
union dsf x y | x == y = dsf
union dsf x y =
if _value x' == _value y'
then dsf
else case compare (_rank x') (_rank y') of
GT -> I.insert (_value y') y'{ _parent = Just x' } dsf
LT -> I.insert (_value x') x'{ _parent = Just y' } dsf
-- 1) increase x's rank by one:
EQ -> let x'' = x'{ _rank = _rank x' + 1 }
-- 2) update the value for x's rank to point to the new x:
dsf' = I.insert (_value x'') x'' dsf
-- 3) then update y to have the new x as its parent:
in I.insert (_value y') y'{ _parent = Just x'' } dsf'
where x' = dsf ! findSet dsf x
y' = dsf ! findSet dsf y
Now, to my real question, if in the EQ
case I had instead done the following:
EQ -> let dsf' = I.insert (_value x') x'{ _rank = _rank x' + 1} dsf
in I.insert (_value y') y'{ _parent = Just x'{ _rank = _rank x' + 1 }} dsf'
I.e. first insert a new Point
x with its rank increased, and then having y'
's parent be a new Point
x with its rank increased, would this mean that they no longer point to the same Point
in memory? (Does this even matter? Should I worry about these things when using/creating persistent data structures?)
And just for completeness, here is findSet
:
findSet :: DSForest -> Int -> Int
findSet dsf' x' = case _parent (dsf' ! x') of
Just (Point v _ _) -> findSet dsf' v
Nothing -> x'
(General comments about the efficiency and design of this code are also welcome.)
Sharing is a compiler thing. When it recognizes common sub-expressions, a compiler may chose to represent them both by the same object in memory. But even if you use such a compiler switch (like
-fno-cse
), it is under no obligation to do so, and the two might be (and usually are, in the absence of the switch) represented by two different, though of equal value, objects in memory. Re: referential transparency.OTOH when we name something and use that name twice, we (reasonably) expect it to represent the same object in memory. But compiler might choose to duplicate it and use two separate copies in two different use sites, although it is not known to do so. But it might. Re: referential transparency.
See also:
Here's few examples with list-producing functions, drawing from the last link above. They rely on the compiler not duplicating anything, i.e. indeed sharing any named object as expected from call by need lambda calculus operational semantics (as explained by nponeccop in the comments), and not introducing any extra sharing on its own to eliminate common subexpressions:
Sharing fixpoint combinator, creating a loop:
fix f = x where x = f x
Non-sharing fixpoint combinator, creating telescoping multistage chain (i.e. regular recursion chain)
_Y f = f (_Y f)
Two-stages combination - a loop and a feed
_2 f = f (fix f)
First comment: the disjoint-set union-find data structure is very, very difficult to do well in a purely functional way. If you are just trying to get practice with persistent data structures, I strongly recommend starting with simpler structures like binary search trees.
Now, to see one problem, consider your findSet function. It does not implement path compression! That is, it does not make all the nodes along the path to the root point directly to the root. To do that, you would want to update all those points in the DSForest, so your function would then return (Int, DSForest) or perhaps (Point, DSForest). Doing this in a monad to handle all the plumbing of passing the DSForest around be easier than passing that forest around manually.
But now a second issue. Suppose you modify findSet as just described. It still wouldn't do quite what you want. In particular, suppose you have a chain where 2 is a child of 1, 3 is a child of 2, and 4 is a child of 3. And now you you do a findSet on 3. This will update 3's point so that its parent is 1 instead 2. But 4's parent is still the old 3 point whose parent is 2. This may not matter too much, because it looks like you never really do anything with the parent Point except pull out its value (in findSet). But the very fact that you never do anything with the parent Point except pull out its value says to me that it should be a Maybe Int instead of a Maybe Point.
Let me repeat and expand on what I said at the beginning. Disjoint sets are a particularly hard data structure to handle in a functional/persistent way, so I strongly recommend starting with an easier tree structure like binary search trees or leftist heaps or even abstract syntax trees. Those structures have the property that all access goes through the root--that is, you always start at the root and work your way down through the tree to get to the right place. This property makes the kind of sharing that is the hallmark of persistent data structures MUCH easier.
The disjoint set data structure does not have that property. Instead of always starting at the root and working down to the nodes of interest, you start at arbitrary nodes and work your way back up to the root. When you have unrestricted entry points like this, often the easiest way to handle it is to mediate all the sharing through a separate map (DSForest in your case), but that means passing that map back and forth everywhere.
I don't think you should be concerned with this as this is just an implementation detail of the runtime system (aka RTS of Haskell) for immutable values.
As far as other suggestion is concerned, I would say make the function
findSet
return thePoint
itself rather than the key as that would eliminate the lookup inunion
.Make appropriate changes in the
union
function.