Haskell/GHC performance of `any`/`all`

2020-04-09 06:26发布

问题:

I wrote quantification functions exists, forall, and none for Haskell's build-in [] list data type. On multiple occasions, these seemed to prove much more efficient than Prelude/Data.Lists any and all. I naively suspect that this performance is due to any and all being implemented using Θ(n) folds. Since I am relatively new to Haskell, I think I must be mistaken, or that there would be a good reason for this phenomenon.

From Data.Foldable:

-- | Determines whether any element of the structure satisfies the predicate.
any :: Foldable t => (a -> Bool) -> t a -> Bool
any p = getAny #. foldMap (Any #. p)

-- | Determines whether all elements of the structure satisfy the predicate.
all :: Foldable t => (a -> Bool) -> t a -> Bool
all p = getAll #. foldMap (All #. p)

My implementations:

exists :: (a -> Bool) -> [a] -> Bool
exists _    []                   = False
exists pred (x : xs) | pred x    = True
                     | otherwise = exists pred xs

and

forall pred  =  not . exists (not . pred)
none pred  =  not . exists pred  =  forall (not . pred)

Eliminating boolean inversions:

forall, none :: (a -> Bool) -> [a] -> Bool

forall _    []                   = True
forall pred (x : xs) | pred x    = forall pred xs
                     | otherwise = False

none _    []                   = True
none pred (x : xs) | pred x    = False
                   | otherwise = none pred xs

all:

time                 327.8 μs   (322.4 μs .. 333.0 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 328.7 μs   (324.1 μs .. 334.2 μs)
std dev              16.95 μs   (14.63 μs .. 22.02 μs)

and forall:

time                 113.2 μs   (111.2 μs .. 115.0 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 112.0 μs   (110.0 μs .. 113.9 μs)
std dev              6.333 μs   (5.127 μs .. 7.896 μs)

Performance measured using criterion's nf.


As anticipated, I did not reinvent the fold, but had underestimated compiler flags, and naively did not expect -O2 to make such drastic overall difference compared to default optimization level performance, nor the optimization effectiveness disparity between the individual custom-wrote and library formulations. Many highly effective specialized standard function optimizations evidently kick in only when explicitly enabled.

The "performance" section of the Haskell tag info emphasizes importance of optimization level compiler flags when testing code efficiency. It is generally advised to trust the sophistication of library function implementations, and, rather than rewire RULES pragmas or reformulate basic forms, to try exploiting already cultivated optimization potential.

回答1:

I find it instructive to re-implement any in the various ways:

import Prelude hiding (any)
import Criterion.Main
import Data.Foldable (foldMap)
import Data.Monoid

Your exists:

exists :: (a -> Bool) -> [a] -> Bool
exists _ [] = False
exists pred (x : xs)
    = if pred x
      then True
      else exists pred xs

A version using (||):

existsOr :: (a -> Bool) -> [a] -> Bool
existsOr _ [] = False
existsOr pred (x : xs) = pred x || existsOr pred xs

Using foldr:

any :: (a -> Bool) -> [a] -> Bool
any pred = foldr ((||) . pred) False

Using foldr and Any:

anyF :: (a -> Bool) -> [a] -> Bool
anyF pred = getAny . foldr (mappend . (Any . pred)) mempty

Using foldMap and Any:

anyFM :: (a -> Bool) -> [a] -> Bool
anyFM pred = getAny . foldMap (Any . pred)

Benchmarks with ghc -O0:

benchmarking exists
time                 1.552 μs   (1.504 μs .. 1.593 μs)
                     0.989 R²   (0.983 R² .. 0.993 R²)
mean                 1.482 μs   (1.427 μs .. 1.545 μs)
std dev              196.1 ns   (168.8 ns .. 229.2 ns)
variance introduced by outliers: 93% (severely inflated)

benchmarking existsOr
time                 2.699 μs   (2.616 μs .. 2.768 μs)
                     0.992 R²   (0.988 R² .. 0.995 R²)
mean                 2.629 μs   (2.554 μs .. 2.704 μs)
std dev              277.8 ns   (235.8 ns .. 351.1 ns)
variance introduced by outliers: 89% (severely inflated)

benchmarking any
time                 5.551 μs   (5.354 μs .. 5.777 μs)
                     0.990 R²   (0.986 R² .. 0.995 R²)
mean                 5.553 μs   (5.395 μs .. 5.750 μs)
std dev              584.2 ns   (447.5 ns .. 835.5 ns)
variance introduced by outliers: 88% (severely inflated)

benchmarking anyF
time                 7.330 μs   (7.081 μs .. 7.612 μs)
                     0.988 R²   (0.982 R² .. 0.994 R²)
mean                 7.502 μs   (7.272 μs .. 7.762 μs)
std dev              848.2 ns   (712.6 ns .. 1.022 μs)
variance introduced by outliers: 89% (severely inflated)

benchmarking anyFM
time                 5.668 μs   (5.451 μs .. 6.008 μs)
                     0.987 R²   (0.975 R² .. 0.996 R²)
mean                 5.807 μs   (5.659 μs .. 5.975 μs)
std dev              542.5 ns   (446.4 ns .. 721.8 ns)
variance introduced by outliers: 86% (severely inflated)

Your version (exists) is indeed the fastest, and the foldr versions are rather slow.

With ghc -O2, your version (exists) is the slowest, and all other functions are nearly equally fast to each other:

benchmarking exists
time                 753.5 ns   (725.4 ns .. 779.9 ns)
                     0.990 R²   (0.986 R² .. 0.995 R²)
mean                 762.4 ns   (737.0 ns .. 787.0 ns)
std dev              82.47 ns   (66.79 ns .. 105.1 ns)
variance introduced by outliers: 91% (severely inflated)

benchmarking existsOr
time                 491.5 ns   (478.2 ns .. 503.2 ns)
                     0.994 R²   (0.992 R² .. 0.996 R²)
mean                 494.5 ns   (481.1 ns .. 512.9 ns)
std dev              54.97 ns   (42.54 ns .. 80.34 ns)
variance introduced by outliers: 92% (severely inflated)

benchmarking any
time                 461.2 ns   (442.0 ns .. 479.7 ns)
                     0.989 R²   (0.985 R² .. 0.993 R²)
mean                 456.0 ns   (439.3 ns .. 476.3 ns)
std dev              60.04 ns   (47.27 ns .. 89.47 ns)
variance introduced by outliers: 94% (severely inflated)

benchmarking anyF
time                 436.9 ns   (415.8 ns .. 461.0 ns)
                     0.978 R²   (0.967 R² .. 0.988 R²)
mean                 450.8 ns   (430.1 ns .. 472.6 ns)
std dev              70.64 ns   (57.04 ns .. 85.92 ns)
variance introduced by outliers: 96% (severely inflated)

benchmarking anyFM
time                 438.9 ns   (426.9 ns .. 449.5 ns)
                     0.993 R²   (0.989 R² .. 0.996 R²)
mean                 435.8 ns   (421.4 ns .. 447.6 ns)
std dev              45.32 ns   (36.73 ns .. 58.74 ns)
variance introduced by outliers: 90% (severely inflated)

If one looks into the simplified Core code (ghc -O2 -ddump-simpl), one sees that there are no foldrs anymore (with -O0, everything is still in there, folds included).

I would thus venture to say that your code is faster (in the un-optimized version, -O0) than the library code, because it is simpler (for the potential price of being less general). The optimized library code is faster than your version, since it is written in a way that its optimization potential is recognized by the compiler. (admittedly, this is a bit of guess work)