The documentation of seq
function says the following:
A note on evaluation order: the expression seq a b
does not guarantee that a
will be evaluated before b
. The only guarantee given by seq
is that the both a
and b
will be evaluated before seq
returns a value. In particular, this means that b
may be evaluated before a
. If you need to guarantee a specific order of evaluation, you must use the function pseq
from the "parallel" package.
So I have a lazy version of sum
function with accumulator:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
Obviously, this is extremely slow on big lists. Now I'm rewriting this function using seq
:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
And I see huge performance increase! But I wonder how reliable it is? Did I get it by luck? Because GHC can evaluate recursive call first (according to documentation) and still accumulate thunks. It looks like I need to use pseq
to ensure that acc'
is always evaluated before recursive call. But with pseq
I see performance decrease in compare to seq
version. Numbers on my machine (for calculating sum [1 .. 10^7]
:
- naive:
2.6s
seq
: 0.2s
pseq
: 0.5s
I'm using GHC-8.2.2 and I compile with stack ghc -- File.hs
command.
After I tried to compile with stack ghc -- -O File.hs
command performance gap between seq
and pseq
is gone. They now both run in 0.2s
.
So does my implementation exhibit the properties I want? Or does GHC has some implementation quirk? Why is pseq
slower? Does there exist some example where seq a b
has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?
The answers so far have focused on the seq
versus pseq
performance issues, but I think you originally wanted to know which of the two you should use.
The short answer is: while both should generate nearly identically performing code in practice (at least when proper optimization flags are turned on), the primitive seq
, and not pseq
, is the correct choice for your situation. Using pseq
is non-idiomatic, confusing, and potentially counterproductive from a performance standpoint, and your reason for using it is based on a flawed understand of what its order-of-evaluation guarantee means and what it implies with respect to performance. While there are no guarantees about performance across different sets of compiler flags (much less across other compilers), if you ever run into a situation where the seq
version of the above code runs significantly slower than the pseq
version using "production quality" optimization flags with the GHC compiler, you should consider it a GHC bug and file a bug report.
The long answer is, of course, longer...
First, let's be clear that seq
and pseq
are semantically identical in the sense that they both satisfy the equations:
seq _|_ b = _|_
seq a b = b -- if a is not _|_
pseq _|_ b = _|_
pseq a b = b -- if a is not _|_
This is really the only thing that either of them guarantees semantically, and since the definition of the Haskell language (as given in the Haskell report say) only makes -- at best -- semantic guarantees and does not deal with performance or implementation, there's no reason to choose between one or the other for reasons of guaranteed performance across different compilers or compiler flags.
Furthermore, in your particular seq
-based version of the function sum
, it's not too difficult to see that there is no situation in which seq
is called with an undefined first argument but a defined second argument (assuming the use of a standard numeric type), so you aren't even using the semantic properties of seq
. You could re-define seq
as seq a b = b
and have exactly the same semantics. Of course, you know this -- that's why your first version didn't use seq
. Instead, you're using seq
for an incidental performance side-effect, so we're out of the realm of semantic guarantees and back in the realm of specific GHC compiler implementation and performance characteristics (where there aren't really any guarantees to speak of).
Second, that brings us to the intended purpose of seq
. It is rarely used for its semantic properties because those properties aren't very useful. Who would want a computation seq a b
to return b
except that it should fail to terminate if some unrelated expression a
fails to terminate? (The exceptions -- no pun intended -- would be things like handling exceptions, where you might use seq
or deepSeq
which is based on seq
to force evaluation of a non-terminating expression in either an uncontrolled or controlled way, before starting evaluation of another expression.)
Instead, seq a b
is intended to force evaluation of a
to weak head normal form before returning the result of b
to prevent accumulation of thunks. The idea is, if you have an expression b
which builds a thunk that could potentially accumulate on top of another unevaluated thunk represented by a
, you can prevent that accumulation by using seq a b
. The "guarantee" is a weak one: GHC guarantees that it understands you don't want a
to remain an unevaluated thunk when seq a b
's value is demanded. Technically, it doesn't guarantee that a
will be "evaluated before" b
, whatever that means, but you don't need that guarantee. When you worry that, without this guarantee, GHC might evaluate the recursive call first and still accumulate thunks, this is as ridiculous as worrying that pseq a b
might evaluate its first argument, then wait 15 minutes (just to make absolutely sure the first argument has been evaluated!), before evaluating its second.
This is a situation where you should trust GHC to do the right thing. It may seem to you that the only way to realize the performance benefit of seq a b
is for a
to be evaluated to WHNF before evaluation of b
starts, but it is conceivable that there are optimizations in this or other situations that technically start evaluating b
(or even fully evaluate b
to WHNF) while leaving a
unevaluated for a short time to improve performance while still preserving the semantics of seq a b
. By using pseq
instead, you may prevent GHC from making such optimizations. (In your sum
program situation, there undoubtedly is no such optimization, but in a more complex use of seq
, there might be.)
Third, it's important to understand what pseq
is actually for. It was first described in Marlow 2009 in the context of concurrent programming. Suppose we want to parallelize two expensive computations foo
and bar
and then combine (say, add) their results:
foo `par` (bar `seq` foo+bar) -- parens redundant but included for clarity
The intention here is that -- when this expression's value is demanded -- it creates a spark to compute foo
in parallel and then, via the seq
expression, starts evaluating bar
to WHNF (i.e., it's numeric value, say) before finally evaluating foo+bar
which will wait on the spark for foo
before adding and returning the results.
Here, it's conceivable that GHC will recognize that for a specific numeric type, (1) foo+bar
automatically fails to terminate if bar
does, satisfying the formal semantic guarantee of seq
; and (2) evaluating foo+bar
to WHNF will automatically force evaluation of bar
to WHNF preventing any thunk accumulation and so satisfying the informal implementation guarantee of seq
. In this situation, GHC may feel free to optimize the seq
away to yield:
foo `par` foo+bar
particularly if it feels that it would be more performant to start evaluation of foo+bar
before finishing evaluating bar
to WHNF.
What GHC isn't smart enough to realize is that -- if evaluation of foo
in foo+bar
starts before the foo
spark is scheduled, the spark will fizzle, and no parallel execution will occur.
It's really only in this case, where you need to explicitly delay demanding the value of a sparked expression to allow an opportunity for it to be scheduled before the main thread "catches up" that you need the extra guarantee of pseq
and are willing to have GHC forgo additional optimization opportunities permitted by the weaker guarantee of seq
:
foo `par` (bar `pseq` foo+bar)
Here, pseq
will prevent GHC from introducing any optimization that might allow foo+bar
to start evaluating (potentially fizzling the foo
spark) before bar
is in WHNF (which, we hope, allows enough time for the spark to be scheduled).
The upshot is that, if you're using pseq
for anything other than concurrent programming, you're using it wrong. (Well, maybe there are some weird situations, but...) If all you want to do is force strict evaluation and/or thunk evaluation to improve performance in non-concurrent code, using seq
(or $!
which is defined in terms of seq
or Haskell strict data types which are defined in terms of $!
) is the correct approach.
(Or, if @Kindaro's benchmarks are to be believed, maybe merciless benchmarking with specific compiler versions and flags is the correct approach.)
I only see such a difference with optimizations turned off.
With ghc -O
both pseq
and seq
perform the same.
The relaxed semantics of seq
allow transformations resulting in slower code indeed. I can't think of a situation where that actually happens. We just assume GHC does the right thing. Unfortunately, we don't have a way to express that behavior in terms of a high-level semantics for Haskell.
Why pseq is slower?
pseq x y = x `seq` lazy y
pseq
is thus implemented using seq
. The observed overhead is due to the extra indirection of calling pseq
.
Even if these ultimately get optimized away, it may not necessarily be a good idea to use pseq
instead of seq
. While the stricter ordering semantics seem to imply the intended effect (that go
does not accumulate a thunk), it may disable some further optimizations: perhaps evaluating x
and evaluating y
can be decomposed into low-level operations, some of which we wouldn't mind to cross the pseq
boundary.
Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?
This can throw either "a"
or "b"
.
seq (error "a") (error "b")
I guess there is a rationale explained in the paper about exceptions in Haskell, A Semantics for imprecise exceptions.
Edit: My theory foiled as the timings I observed were in fact heavily skewed by the influence of profiling itself; with profiling off, the data goes against the theory. Moreover, the timings vary quite a bit between versions of GHC. I am collecting better observations even now, and I will further edit this answer as I arrive to a conclusive point.
Concerning the question "why pseq
is slower", I have a theory.
- Let us re-phrase
acc' `seq` go acc' xs
as strict (go (strict acc') xs)
.
- Similarly,
acc' `pseq` go acc' xs
is re-phrased as lazy (go (strict acc') xs)
.
- Now, let us re-phrase
go acc (x:xs) = let ... in ...
to go acc (x:xs) = strict (go (x + acc) xs)
in the case of seq
.
- And to
go acc (x:xs) = lazy (go (x + acc) xs)
in the case of pseq
.
Now, it is easy to see that, in the case of pseq
, go
gets assigned a lazy thunk that will be evaluated at some later point. In the definition of sum
, go
never appears to the left of pseq
, and thus, during the run of sum
, the evaulation will not at all be forced. Moreover, this happens for every recursive call of go
, so thunks accumulate.
This is a theory built from thin air, but I do have a partial proof. Specifically, I did find out that go
allocates linear memory in pseq
case but not in the case of seq
. You may see for yourself if you run the following shell commands:
for file in SumNaive.hs SumPseq.hs SumSeq.hs
do
stack ghc \
--library-profiling \
--package parallel \
-- \
$file \
-main-is ${file%.hs} \
-o ${file%.hs} \
-prof \
-fprof-auto
done
for file in SumNaive.hs SumSeq.hs SumPseq.hs
do
time ./${file%.hs} +RTS -P
done
-- And compare the memory allocation of the go
cost centre.
COST CENTRE ... ticks bytes
SumNaive.prof:sum.go ... 782 559999984
SumPseq.prof:sum.go ... 669 800000016
SumSeq.prof:sum.go ... 161 0
postscriptum
Since there appears to be discord on the question of which optimizations actually play to what effect, I am putting my exact source code and time
measures, so that there is a common baseline.
SumNaive.hs
module SumNaive where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
main = print $ sum [1..10^7]
SumSeq.hs
module SumSeq where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
main = print $ sum [1..10^7]
SumPseq.hs
module SumPseq where
import Prelude hiding (sum)
import Control.Parallel (pseq)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `pseq` go acc' xs
main = print $ sum [1..10^7]
Time without optimizations:
./SumNaive +RTS -P 4.72s user 0.53s system 99% cpu 5.254 total
./SumSeq +RTS -P 0.84s user 0.00s system 99% cpu 0.843 total
./SumPseq +RTS -P 2.19s user 0.22s system 99% cpu 2.408 total
Time with -O
:
./SumNaive +RTS -P 0.58s user 0.00s system 99% cpu 0.584 total
./SumSeq +RTS -P 0.60s user 0.00s system 99% cpu 0.605 total
./SumPseq +RTS -P 1.91s user 0.24s system 99% cpu 2.147 total
Time with -O2
:
./SumNaive +RTS -P 0.57s user 0.00s system 99% cpu 0.570 total
./SumSeq +RTS -P 0.61s user 0.01s system 99% cpu 0.621 total
./SumPseq +RTS -P 1.92s user 0.22s system 99% cpu 2.137 total
It may be seen that:
Naive variant has poor performance without optimizations, but excellent performance with either -O
or -O2
-- to the extent that it outperforms all others.
seq
variant has a good performance that's very little improved by optimizations, so that with either -O
or -O2
the Naive variant outperforms it.
pseq
variant has consistently poor performance, about twice better than Naive variant without optimization, and four times worse than others with either -O
or -O2
. Optimization affects it about as little as the seq
variant.