parallel “Folding” in Haskell

2020-02-12 03:30发布

问题:

I have a function with type below:

union :: a -> a -> a

And a has additivity property. So we can regard union as a version of (+)

Say, we have [a], and want to perform a parallel "folding", for non-parallel foldling we can do only:

foldl1' union [a]

But how to perform it in parallel? I can demonstrate problem on Num values and (+) function.

For example, we have a list [1,2,3,4,5,6] and (+) In parallel we should split

[1,2,3] (+) [4,5,6]
[1,2] (+) [3] (+) [4,5] (+) [6]
([1] (+) [2]) (+) ([3] (+) [4]) (+) ([5] (+) [6])

then each (+) operation we want to perform in parallel, and combine to answer

[3] (+) [7] (+) [11] = 21

Note, that we split list, or perform operations in any order, because of a additivity.

Is there any ways to do that using any standard library?

回答1:

You need to generalize your union to any associative binary operator ⊕ such that (a ⊕ b) ⊕ c == a ⊕ (b ⊕ c). If at the same time you even have a unit element that is neutral with respect to ⊕, you have a monoid.

The important aspect of associativity is that you can arbitrarily group chunks of consecutive elements in a list, and ⊕ them in any order, since a ⊕ (b ⊕ (c ⊕ d)) == (a ⊕ b) ⊕ (c ⊕ d) - each bracket can be computed in parallel; then you'd need to "reduce" the "sums" of all brackets, and you've got your map-reduce sorted.

In order for this parallellisation to make sense, you need the chunking operation to be faster than ⊕ - otherwise, doing ⊕ sequentially is better than chunking. One such case is when you have a random access "list" - say, an array. Data.Array.Repa has plenty of parallellized folding functions.

If you are thinking of practicising to implement one yourself, you need to pick a good complex function ⊕ such that the benefit will show.

For example:

import Control.Parallel
import Data.List

pfold :: (Num a, Enum a) => (a -> a -> a) -> [a] -> a
pfold _ [x] = x
pfold mappend xs  = (ys `par` zs) `pseq` (ys `mappend` zs) where
  len = length xs
  (ys', zs') = splitAt (len `div` 2) xs
  ys = pfold mappend ys'
  zs = pfold mappend zs'

main = print $ pfold (+) [ foldl' (*) 1 [1..x] | x <- [1..5000] ]
  -- need a more complicated computation than (+) of numbers
  -- so we produce a list of products of many numbers

Here I deliberately use a associative operation, which is called mappend only locally, to show it can work for a weaker notion than a monoid - only associativity matters for parallelism; since parallelism makes sense only for non-empty lists anyway, no need for mempty.

ghc -O2 -threaded a.hs
a +RTS -N1 -s

Gives 8.78 seconds total run time, whereas

a +RTS -N2 -s

Gives 5.89 seconds total run time on my dual core laptop. Obviously, no point trying more than -N2 on this machine.



回答2:

What you've described is essentially a monoid. In GHCI:

Prelude> :m + Data.Monoid
Prelude Data.Monoid> :info Monoid
class Monoid a where
  mempty :: a
  mappend :: a -> a -> a
  mconcat :: [a] -> a

As you can see a monoid has three associated functions:

  1. The mempty function is sort of like the identity function of the monoid. For example a Num can behave as a monoid apropos two operations: sum and product. For a sum mempty is defined as 0. For a product mempty is defined as 1.

    mempty `mappend` a = a
    a `mappend` mempty = a
    
  2. The mappend function is similar to your union function. For exampe for a sum of Nums mappend is defined as (+) and for a product of Nums mappend is defined as (*).

  3. The mconcat function is similar to a fold. However because of the properties of a monoid it doesn't matter whether we fold from the left, fold from the right or fold from an arbitrary position. This is because mappend is supposed to be associative:

    (a `mappend` b) `mappend` c =  a `mappend` (b `mappend` c)
    

Note however that Haskell doesn't enforce the monoid laws. Hence if you make a type an instance of the Monoid typeclass then you're responsible to ensure that it satisfies the monoid laws.

In your case it's difficult to understand how union behaves from its type signature: a -> a -> a. Surely you can't make a type variable an instance of a typeclass. That's not allowed. You need to be more specific. What does union actually do?

To give you an example of how to make a type an instance of the monoid typeclass:

newtype Sum a = Sum { getSum :: a }

instance Num a => Monoid (Sum a) where
    mempty = 0
    mappend = (+)

That's it. We don't need to define the mconcat function because that has a default implementation that depends upon mempty and mappend. Hence when we define mempty and mappend we get mconcat for free.

Now you can use Sum as follows:

getSum . mconcat $ map Sum [1..6]

This is what's happening:

  1. You're mapping the Sum constructor over [1..6] to produce [Sum 1, Sum 2, Sum 3, Sum 4, Sum 5, Sum 6].
  2. You give the resulting list to mconcat which folds it to Sum 21.
  3. You use getSum to extract the Num from Sum 21.

Note however that the default implementation of mconcat is foldr mappend mempty (i.e. it's a right fold). For most cases the default implementation is sufficient. However in your case you might want to override the default implementation:

foldParallel :: Monoid a => [a] -> a
foldParallel []  = mempty
foldParallel [a] = a
foldParallel xs  = foldParallel left `mappend` foldParallel right
    where size = length xs
          index = (size + size `mod` 2) `div` 2
          (left, right) = splitAt index xs

Now we can create a new instance of Monoid as follows:

data Something a = Something { getSomething :: a }

instance Monoid (Something a) where
    mempty  = unionEmpty
    mappend = union
    mconcat = foldParallel

We use it as follows:

getSomething . mconcat $ map Something [1..6]

Note that I defined mempty as unionEmpty. I don't know what type of data the union function acts on. Hence I don't know what mempty should be defined as. Thus I'm simply calling it unionEmpty. Define it as you see fit.