Always guaranteed evaluation order of `seq` (with

2020-06-08 03:25发布

问题:

The documentation of seq function says the following:

A note on evaluation order: the expression seq a b does not guarantee that a will be evaluated before b. The only guarantee given by seq is that the both a and b will be evaluated before seq returns a value. In particular, this means that b may be evaluated before a. If you need to guarantee a specific order of evaluation, you must use the function pseq from the "parallel" package.

So I have a lazy version of sum function with accumulator:

sum :: Num a => [a] -> a
sum = go 0
  where
    go acc []     = acc
    go acc (x:xs) = go (x + acc) xs

Obviously, this is extremely slow on big lists. Now I'm rewriting this function using seq:

sum :: Num a => [a] -> a
sum = go 0
  where
    go acc []     = acc
    go acc (x:xs) = let acc' = x + acc
                    in acc' `seq` go acc' xs

And I see huge performance increase! But I wonder how reliable it is? Did I get it by luck? Because GHC can evaluate recursive call first (according to documentation) and still accumulate thunks. It looks like I need to use pseq to ensure that acc' is always evaluated before recursive call. But with pseq I see performance decrease in compare to seq version. Numbers on my machine (for calculating sum [1 .. 10^7]:

  • naive: 2.6s
  • seq: 0.2s
  • pseq: 0.5s

I'm using GHC-8.2.2 and I compile with stack ghc -- File.hs command.

After I tried to compile with stack ghc -- -O File.hs command performance gap between seq and pseq is gone. They now both run in 0.2s.

So does my implementation exhibit the properties I want? Or does GHC has some implementation quirk? Why is pseq slower? Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?

回答1:

The answers so far have focused on the seq versus pseq performance issues, but I think you originally wanted to know which of the two you should use.

The short answer is: while both should generate nearly identically performing code in practice (at least when proper optimization flags are turned on), the primitive seq, and not pseq, is the correct choice for your situation. Using pseq is non-idiomatic, confusing, and potentially counterproductive from a performance standpoint, and your reason for using it is based on a flawed understand of what its order-of-evaluation guarantee means and what it implies with respect to performance. While there are no guarantees about performance across different sets of compiler flags (much less across other compilers), if you ever run into a situation where the seq version of the above code runs significantly slower than the pseq version using "production quality" optimization flags with the GHC compiler, you should consider it a GHC bug and file a bug report.

The long answer is, of course, longer...

First, let's be clear that seq and pseq are semantically identical in the sense that they both satisfy the equations:

seq _|_ b = _|_
seq a b = b -- if a is not _|_
pseq _|_ b = _|_
pseq a b = b -- if a is not _|_

This is really the only thing that either of them guarantees semantically, and since the definition of the Haskell language (as given in the Haskell report say) only makes -- at best -- semantic guarantees and does not deal with performance or implementation, there's no reason to choose between one or the other for reasons of guaranteed performance across different compilers or compiler flags.

Furthermore, in your particular seq-based version of the function sum, it's not too difficult to see that there is no situation in which seq is called with an undefined first argument but a defined second argument (assuming the use of a standard numeric type), so you aren't even using the semantic properties of seq. You could re-define seq as seq a b = b and have exactly the same semantics. Of course, you know this -- that's why your first version didn't use seq. Instead, you're using seq for an incidental performance side-effect, so we're out of the realm of semantic guarantees and back in the realm of specific GHC compiler implementation and performance characteristics (where there aren't really any guarantees to speak of).

Second, that brings us to the intended purpose of seq. It is rarely used for its semantic properties because those properties aren't very useful. Who would want a computation seq a b to return b except that it should fail to terminate if some unrelated expression a fails to terminate? (The exceptions -- no pun intended -- would be things like handling exceptions, where you might use seq or deepSeq which is based on seq to force evaluation of a non-terminating expression in either an uncontrolled or controlled way, before starting evaluation of another expression.)

Instead, seq a b is intended to force evaluation of a to weak head normal form before returning the result of b to prevent accumulation of thunks. The idea is, if you have an expression b which builds a thunk that could potentially accumulate on top of another unevaluated thunk represented by a, you can prevent that accumulation by using seq a b. The "guarantee" is a weak one: GHC guarantees that it understands you don't want a to remain an unevaluated thunk when seq a b's value is demanded. Technically, it doesn't guarantee that a will be "evaluated before" b, whatever that means, but you don't need that guarantee. When you worry that, without this guarantee, GHC might evaluate the recursive call first and still accumulate thunks, this is as ridiculous as worrying that pseq a b might evaluate its first argument, then wait 15 minutes (just to make absolutely sure the first argument has been evaluated!), before evaluating its second.

This is a situation where you should trust GHC to do the right thing. It may seem to you that the only way to realize the performance benefit of seq a b is for a to be evaluated to WHNF before evaluation of b starts, but it is conceivable that there are optimizations in this or other situations that technically start evaluating b (or even fully evaluate b to WHNF) while leaving a unevaluated for a short time to improve performance while still preserving the semantics of seq a b. By using pseq instead, you may prevent GHC from making such optimizations. (In your sum program situation, there undoubtedly is no such optimization, but in a more complex use of seq, there might be.)

Third, it's important to understand what pseq is actually for. It was first described in Marlow 2009 in the context of concurrent programming. Suppose we want to parallelize two expensive computations foo and bar and then combine (say, add) their results:

foo `par` (bar `seq` foo+bar)  -- parens redundant but included for clarity

The intention here is that -- when this expression's value is demanded -- it creates a spark to compute foo in parallel and then, via the seq expression, starts evaluating bar to WHNF (i.e., it's numeric value, say) before finally evaluating foo+bar which will wait on the spark for foo before adding and returning the results.

Here, it's conceivable that GHC will recognize that for a specific numeric type, (1) foo+bar automatically fails to terminate if bar does, satisfying the formal semantic guarantee of seq; and (2) evaluating foo+bar to WHNF will automatically force evaluation of bar to WHNF preventing any thunk accumulation and so satisfying the informal implementation guarantee of seq. In this situation, GHC may feel free to optimize the seq away to yield:

foo `par` foo+bar

particularly if it feels that it would be more performant to start evaluation of foo+bar before finishing evaluating bar to WHNF.

What GHC isn't smart enough to realize is that -- if evaluation of foo in foo+bar starts before the foo spark is scheduled, the spark will fizzle, and no parallel execution will occur.

It's really only in this case, where you need to explicitly delay demanding the value of a sparked expression to allow an opportunity for it to be scheduled before the main thread "catches up" that you need the extra guarantee of pseq and are willing to have GHC forgo additional optimization opportunities permitted by the weaker guarantee of seq:

foo `par` (bar `pseq` foo+bar)

Here, pseq will prevent GHC from introducing any optimization that might allow foo+bar to start evaluating (potentially fizzling the foo spark) before bar is in WHNF (which, we hope, allows enough time for the spark to be scheduled).

The upshot is that, if you're using pseq for anything other than concurrent programming, you're using it wrong. (Well, maybe there are some weird situations, but...) If all you want to do is force strict evaluation and/or thunk evaluation to improve performance in non-concurrent code, using seq (or $! which is defined in terms of seq or Haskell strict data types which are defined in terms of $!) is the correct approach.

(Or, if @Kindaro's benchmarks are to be believed, maybe merciless benchmarking with specific compiler versions and flags is the correct approach.)



回答2:

I only see such a difference with optimizations turned off. With ghc -O both pseq and seq perform the same.

The relaxed semantics of seq allow transformations resulting in slower code indeed. I can't think of a situation where that actually happens. We just assume GHC does the right thing. Unfortunately, we don't have a way to express that behavior in terms of a high-level semantics for Haskell.

Why pseq is slower?

pseq x y = x `seq` lazy y

pseq is thus implemented using seq. The observed overhead is due to the extra indirection of calling pseq.

Even if these ultimately get optimized away, it may not necessarily be a good idea to use pseq instead of seq. While the stricter ordering semantics seem to imply the intended effect (that go does not accumulate a thunk), it may disable some further optimizations: perhaps evaluating x and evaluating y can be decomposed into low-level operations, some of which we wouldn't mind to cross the pseq boundary.

Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?

This can throw either "a" or "b".

seq (error "a") (error "b")

I guess there is a rationale explained in the paper about exceptions in Haskell, A Semantics for imprecise exceptions.



回答3:

Edit: My theory foiled as the timings I observed were in fact heavily skewed by the influence of profiling itself; with profiling off, the data goes against the theory. Moreover, the timings vary quite a bit between versions of GHC. I am collecting better observations even now, and I will further edit this answer as I arrive to a conclusive point.


Concerning the question "why pseq is slower", I have a theory.

    • Let us re-phrase acc' `seq` go acc' xs as strict (go (strict acc') xs).
    • Similarly, acc' `pseq` go acc' xs is re-phrased as lazy (go (strict acc') xs).
    • Now, let us re-phrase go acc (x:xs) = let ... in ... to go acc (x:xs) = strict (go (x + acc) xs) in the case of seq.
    • And to go acc (x:xs) = lazy (go (x + acc) xs) in the case of pseq.

Now, it is easy to see that, in the case of pseq, go gets assigned a lazy thunk that will be evaluated at some later point. In the definition of sum, go never appears to the left of pseq, and thus, during the run of sum, the evaulation will not at all be forced. Moreover, this happens for every recursive call of go, so thunks accumulate.

This is a theory built from thin air, but I do have a partial proof. Specifically, I did find out that go allocates linear memory in pseq case but not in the case of seq. You may see for yourself if you run the following shell commands:

for file in SumNaive.hs SumPseq.hs SumSeq.hs 
do
    stack ghc                \
        --library-profiling  \
        --package parallel   \
        --                   \
        $file                \
        -main-is ${file%.hs} \
        -o ${file%.hs}       \
        -prof                \
        -fprof-auto
done

for file in SumNaive.hs SumSeq.hs SumPseq.hs
do
    time ./${file%.hs} +RTS -P
done

-- And compare the memory allocation of the go cost centre.

COST CENTRE             ...  ticks     bytes
SumNaive.prof:sum.go    ...    782 559999984
SumPseq.prof:sum.go     ...    669 800000016
SumSeq.prof:sum.go      ...    161         0

postscriptum

Since there appears to be discord on the question of which optimizations actually play to what effect, I am putting my exact source code and time measures, so that there is a common baseline.

SumNaive.hs

module SumNaive where

import Prelude hiding (sum)

sum :: Num a => [a] -> a
sum = go 0
  where
    go acc []     = acc
    go acc (x:xs) = go (x + acc) xs

main = print $ sum [1..10^7]

SumSeq.hs

module SumSeq where

import Prelude hiding (sum)

sum :: Num a => [a] -> a
sum = go 0
  where
    go acc []     = acc
    go acc (x:xs) = let acc' = x + acc
                    in acc' `seq` go acc' xs

main = print $ sum [1..10^7]

SumPseq.hs

module SumPseq where

import Prelude hiding (sum)
import Control.Parallel (pseq)

sum :: Num a => [a] -> a
sum = go 0
  where
    go acc []     = acc
    go acc (x:xs) = let acc' = x + acc
                    in acc' `pseq` go acc' xs

main = print $ sum [1..10^7]

Time without optimizations:

./SumNaive +RTS -P  4.72s user 0.53s system 99% cpu 5.254 total
./SumSeq +RTS -P  0.84s user 0.00s system 99% cpu 0.843 total
./SumPseq +RTS -P  2.19s user 0.22s system 99% cpu 2.408 total

Time with -O:

./SumNaive +RTS -P  0.58s user 0.00s system 99% cpu 0.584 total
./SumSeq +RTS -P  0.60s user 0.00s system 99% cpu 0.605 total
./SumPseq +RTS -P  1.91s user 0.24s system 99% cpu 2.147 total

Time with -O2:

./SumNaive +RTS -P  0.57s user 0.00s system 99% cpu 0.570 total
./SumSeq +RTS -P  0.61s user 0.01s system 99% cpu 0.621 total
./SumPseq +RTS -P  1.92s user 0.22s system 99% cpu 2.137 total

It may be seen that:

  • Naive variant has poor performance without optimizations, but excellent performance with either -O or -O2 -- to the extent that it outperforms all others.

  • seq variant has a good performance that's very little improved by optimizations, so that with either -O or -O2 the Naive variant outperforms it.

  • pseq variant has consistently poor performance, about twice better than Naive variant without optimization, and four times worse than others with either -O or -O2. Optimization affects it about as little as the seq variant.