In one discussion I heard that Applicative
interface of some parsers is implemented differently, more efficiently than their Monad
interface. The reason is that with Applicative
we know all "effects" in advance, before the whole effectful computation is run. With monads, effects can depend on values during the computation so this optimization is not possible.
I'd like to see some good examples of this. It can be some very simple parser or some different monad, that's not important. The important thing is that the Applicative
interface of such a monad complies with its return
and ap
, but using the Applicative
produces more efficient code.
Update: Just to clarify, here I'm not interested in applicatives that can't be monads. The question is about things that are both.
Another example is a strict left fold. You can write an applicative instance which allows you to compose folds so that the resulting fold can be performed on the data in a single pass and constant space. However, the monad instance needs to re-iterate from the beginning of the data for each bind and keep the whole list in memory.
{-# LANGUAGE GADTs #-}
import Criterion.Main
import Data.Monoid
import Control.Applicative
import Control.Monad
import Prelude hiding (sum)
data Fold e r where
Step :: !(a -> e -> a) -> !a -> !(a -> r) -> Fold e r
Bind :: !(Fold e r) -> !(r -> Fold e s) -> Fold e s
data P a b = P !a !b
instance Functor (Fold e) where
fmap f (Step step acc ret) = Step step acc (f . ret)
fmap f (Bind fld g) = Bind fld (fmap f . g)
instance Applicative (Fold e) where
pure a = Step const a id
Step fstep facc fret <*> Step xstep xacc xret = Step step acc ret where
step (P fa xa) e = P (fstep fa e) (xstep xa e)
acc = P facc xacc
ret (P fa xa) = (fret fa) (xret xa)
Bind fld g <*> fldx = Bind fld ((<*> fldx) . g)
fldf <*> Bind fld g = Bind fld ((fldf <*>) . g)
instance Monad (Fold e) where
return = pure
(>>=) = Bind
fold :: Fold e r -> [e] -> r
fold (Step _ acc ret) [] = ret acc
fold (Step step acc ret) (x:xs) = fold (Step step (step acc x) ret) xs
fold (Bind fld g) lst = fold (g $ fold fld lst) lst
monoidalFold :: Monoid m => (e -> m) -> (m -> r) -> Fold e r
monoidalFold f g = Step (\a -> mappend a . f) mempty g
count :: Num n => Fold e n
count = monoidalFold (const (Sum 1)) getSum
sum :: Num n => Fold n n
sum = monoidalFold Sum getSum
avgA :: Fold Double Double
avgA = liftA2 (/) sum count
avgM :: Fold Double Double
avgM = liftM2 (/) sum count
main :: IO ()
main = defaultMain
[ bench "Monadic" $ nf (test avgM) 1000000
, bench "Applicative" $ nf (test avgA) 1000000
] where test f n = fold f [1..n]
I wrote the above from the top of my head as an example so it might not be the optimal implementation for applicative and monadic folds, but running the above gives me:
benchmarking Monadic
mean: 119.3114 ms, lb 118.8383 ms, ub 120.2822 ms, ci 0.950
std dev: 3.339376 ms, lb 2.012613 ms, ub 6.215090 ms, ci 0.950
benchmarking Applicative
mean: 51.95634 ms, lb 51.81261 ms, ub 52.15113 ms, ci 0.950
std dev: 850.1623 us, lb 667.6838 us, ub 1.127035 ms, ci 0.950
Perhaps the canonical example is given by the vectors.
data Nat = Z | S Nat deriving (Show, Eq, Ord)
data Vec :: Nat -> * -> * where
V0 :: Vec Z x
(:>) :: x -> Vec n x -> Vec (S n) x
We can make them applicative with a little effort, first defining singletons, then wrapping them in a class.
data Natty :: Nat -> * where
Zy :: Natty Z
Sy :: Natty n -> Natty (S n)
class NATTY (n :: Nat) where
natty :: Natty n
instance NATTY Z where
natty = Zy
instance NATTY n => NATTY (S n) where
natty = Sy natty
Now we may develop the Applicative
structure
instance NATTY n => Applicative (Vec n) where
pure = vcopies natty
(<*>) = vapp
vcopies :: forall n x. Natty n -> x -> Vec n x
vcopies Zy x = V0
vcopies (Sy n) x = x :> vcopies n x
vapp :: forall n s t. Vec n (s -> t) -> Vec n s -> Vec n t
vapp V0 V0 = V0
vapp (f :> fs) (s :> ss) = f s :> vapp fs ss
I omit the Functor
instance (which should be extracted via fmapDefault
from the Traversable
instance).
Now, there is a Monad
instance corresponding to this Applicative
, but what is it? Diagonal thinking! That's what's required! A vector can be seen as the tabulation of a function from a finite domain, hence the Applicative
is just a tabulation of the K- and S-combinators, and the Monad
has a Reader
-like behaviour.
vtail :: forall n x. Vec (S n) x -> Vec n x
vtail (x :> xs) = xs
vjoin :: forall n x. Natty n -> Vec n (Vec n x) -> Vec n x
vjoin Zy _ = V0
vjoin (Sy n) ((x :> _) :> xxss) = x :> vjoin n (fmap vtail xxss)
instance NATTY n => Monad (Vec n) where
return = vcopies natty
xs >>= f = vjoin natty (fmap f xs)
You might save a bit by defining >>=
more directly, but any way you cut it, the monadic behaviour creates useless thunks for off-diagonal computations. Laziness might save us from slowing down by an armageddon factor, but the zipping behaviour of the <*>
is bound to be at least a little cheaper than taking the diagonal of a matrix.
As pigworker said, arrays are the obvious example; their monad instance is not just a bit more problematic on the conceptual level with type-indexed lengths etc., but also performs worse in the very much real-worldly Data.Vector
implementation:
import Criterion.Main
import Data.Vector as V
import Control.Monad
import Control.Applicative
functions :: V.Vector (Int -> Int)
functions = V.fromList [(+1), (*2), (subtract 1), \x -> x*x]
values :: V.Vector Int
values = V.enumFromN 1 32
type NRuns = Int
apBencher :: (V.Vector (Int -> Int) -> V.Vector Int -> V.Vector Int)
-> NRuns -> Int
apBencher ap' = run values
where run arr 0 = V.sum arr
run arr n = run (functions `ap'` arr) $ n-1
main = defaultMain
[ bench "Monadic" $ nf (apBencher ap ) 4
, bench "Applicative" $ nf (apBencher (<*>)) 4 ]
$ ghc-7.6 -O1 -o -fllvm -o bin/bench-d0 def0.hs
$ bench-d0
warming up
estimating clock resolution...
mean is 1.516271 us (640001 iterations)
found 3768 outliers among 639999 samples (0.6%)
2924 (0.5%) high severe
estimating cost of a clock call...
mean is 41.62906 ns (12 iterations)
found 1 outliers among 12 samples (8.3%)
1 (8.3%) high severe
benchmarking Monadic
mean: 2.773062 ms, lb 2.769786 ms, ub 2.779151 ms, ci 0.950
std dev: 22.14540 us, lb 13.55686 us, ub 36.88265 us, ci 0.950
benchmarking Applicative
mean: 1.269351 ms, lb 1.267654 ms, ub 1.271526 ms, ci 0.950
std dev: 9.799454 us, lb 8.171284 us, ub 13.09267 us, ci 0.950
Note that it doesn't come out with the performance difference when you compile with -O2
; apparently ap
is replaced by <*>
then. But >>=
can only allocate the right amount of memory after each function call and then put the results in place, which appears to be quite time-expensive; whereas <*>
can simply precompute the result length as the product of functions
and values
lengths, and then write to one fixed array.