Haskell shying away from probabilistic data struct

2019-03-14 14:43发布

问题:

If you search for skips lists implemented in Haskell, you won't find many. It is a probabilistic data structure needing a random number generator, meaning that any of these structures would need to run in the IO monad.

Are Haskell folks staying away from these data structures because it's not possible to implement them purely? How can Haskell deal with them?

回答1:

A pseudo-random number generator can of course be used outside of IO, by simply storing the current generator value along with the probabilistic pure data structure and updating it when constructing modified versions. The downside to this is that the PRNG will be more obviously deterministic than in an impure program, since nothing outside of the single data structure will update it. If only the statistical properties matter this poses no issue, but could be cause for concern otherwise.

On the other hand, hiding an impure PRNG is arguably a justified use of unsafePerformIO as in Ganesh Sittampalam's answer. This blatantly violates referential transparency, but only to the extent that the PRNG will return unpredictable, inconsistent values--which is the whole point! Caution is still required, however, as the compiler may make incorrect assumptions about the code because it looks pure.

But really, neither approach is terribly appealing. Using unsafePerformIO is unsatisfying and potentially dangerous. Threading a PRNG state is easy, but imposes a (potentially spurious) strict sequencing on any computations that use it. Neither safety nor laziness are relinquished lightly by Haskell programmers (and rightly so!), and of course data structures restricted to IO are of limited utility. So, to answer part of your question, that's why Haskell programmers are likely to avoid such structures.


As for "how Haskell can deal with" such things, I would suggest that this is the wrong question to ask.

What it really comes down to is that many data structures and algorithms implicitly assume (and optimize for) an imperative, impure, strict language, and while it's certainly possible to implement these in Haskell it's rarely desirable, because (even ignoring the internal implementation) using them imposes on your code a structure and approach that is very much not idiomatic. Furthermore, because Haskell violates those implicit assumptions, performance is often degraded (sometimes badly so).

The thing to realize is that algorithms and data structures are a means, not an end. It's rarely the case that a single specific implementation is required--what is required is generally certain performance characteristics. Finding data structures/algorithms that offer the desired characteristics while also being idiomatic Haskell is almost always a better plan, and is likely to perform better than trying to cram a strict imperative peg into a lazy functional hole.

This mistake is perhaps most commonly found in the subset of programmers who never met a problem they couldn't solve with a hash table, but the habit is easy to fall into for many of us. The correct approach is to stop thinking "how do I implement this solution in Haskell", but instead "what is the best way to solve my problem in Haskell". You might be surprised how often the answers differ; I know I often am!



回答2:

Skip lists can be implemented purely -- just encapsulate the current seed in the state of the skip list itself.

data SkipList a = SkipList StdGen (Node a)
data Node a = ...

This may expose you to some complexity attacks that aren't practical against 'real' skip lists, since you could probe for degenerate insertion orders and replay attacks against the same seed, but it lets you derive the benefits of the structure when adversarial usage is not a problem.

You could also fall back on unsafePerformIO and a carefully crafted side-effect-oblivious seemingly-pure interface. While admittedly it is not pure internally, the interface gives the appearance of purity.

That said, many of the classical performance benefits from skiplists come from when they can be implemented non-persistently, and that precludes a functional interface.



回答3:

Since skiplists have a pure interface, it would be valid to make an implementation using IO internally and to wrap that with unsafePerformIO for the interface. This simply moves the burden of "getting it right" from the language to the programmer (which is where the burden always lies in impure languages).



回答4:

I once had a go at implementing a skip list in Haskell. Of course it was an immutable data structure (this is Haskell, after all). But that meant that the need for randomness disappeared; "fromList" just counted items and built skip arrays of the right length for each item (2 pointers every 4th item, 3 every 16th, 4 every 64th, etc).

At that point I realised I was just building a more complicated version of a tree with a lot less ability to mutate it. So I gave up.



回答5:

Random generators don't require IO operations. They follow their own monadic laws (kinda derived from the State monad) and are therefore representable through a Random monad.

In case of the skip list, you could define your own monad that's able to carry around the probabilistic computation or just use standard Random.

demo :: Random Int
demo = do
 let l = SkipList.empty

 l2 <- l `add` ("Hello", 42)

 return $ l2 `get` "Hello"


回答6:

There is a new skip list implementation based on STM for haskell, see tskiplist on hackage.



回答7:

Well, first, the random number generator in the IO monad is there for convenience. You can use random number generators outside the IO monad; see System.Random. But, yes, you do need to maintain state; the ST monad is useful here. And, yes, i'd say Haskell programmer's prefer the pure data structures.