How to apply data parallelim on haskell Fast Fouri

2019-03-18 07:43发布

I have a haskell code to resolve a Fast Fourier Transformation, and i want to apply data parallelism on it. However, every strategy that i use generate too many sparks and most of them are being overflowed.

Does anyone have any idea on how to apply a good data parallelism strategy on the following algorithm:

-- radix 2 Cooley-Tukey FFT

fft::[Complex Float] -> [Complex Float]
fft [a] = [a]
fft as = interleave ls rs
  where
    (cs,ds) = bflyS as
    ls = fft cs
    rs = fft ds

interleave [] bs = bs
interleave (a:as) bs = a : interleave bs as

bflyS :: [Complex Float] -> ([Complex Float], [Complex Float])
bflyS as = (los,rts)
  where
    (ls,rs) = halve as
    los = zipWith (+) ls rs
    ros = zipWith (-) ls rs
    rts = zipWith (*) ros [tw (length as) i | i <- [0..(length ros) - 1]]

halve as = splitAt n' as
  where
    n' = div (length as + 1) 2

-- twiddle factors
tw :: Int -> Int -> Complex Float
tw n k = cis (-2 * pi * fromIntegral k / fromIntegral n)

PAR MONAD

The answer from leftaroundabout helped me a lot about understanging on how to apply data parallelism on the code. However, i have studied the par monad and tried to apply task parallelism to it. The problem is that it is running way slower than the original bflyS. I think the code i developed is way to expensive to create threads comparing to the relative work I am doing. Does anyone know how to use the par monad in a better way ?

--New Task Parallelism bflyS

bflySTask :: [Complex Float] -> ([Complex Float], [Complex Float])
bflySTask as = runPar $ do
    let (ls, rs) = halve as
    v1<-new
    v2<-new
    let ros = DATA.zipWith (-) ls rs
    let aux = DATA.map  (tw n) [0..n-1]
    fork $ put v1 (DATA.zipWith (+) ls rs)
    fork $ put v2 (DATA.zipWith (*) ros aux)
    los <- get v1
    rts <- get v2   
    return (los, rts)
        where
                n = DATA.length as

1条回答
对你真心纯属浪费
2楼-- · 2019-03-18 08:14

First off: there's a lot of optimisation to be done here before I'd start to think about parallelism:

  • Lists rock, but their non-consecutive memory model means they just can't allow for traversals nearly as fast as what you can achieve with tight arrays such as Data.Vector, because you inevitably end up with lots of cache misses. Indeed I've seldom seen a list-based algorithm to gain much from parallelisation, because they're bound by memory- rather than CPU performance.

  • Your twiddle factors are computed over and over again, you can obviously gain a lot through memoisation here.

  • You keep on calling length, but that's an extremely wasteful function (O (n) for something that could be O (1)). Use some container that probably handles length; lists aren't meant to (we like to keep their ability to be infinite).

The parallelisation itself will be pretty simple; I'd check on the length as suggested by John L, indeed I'd rather require a pretty large size before sparking a thread, at least something like 256: as the performance probably becomes crucial only at sizes of several thousands, this should sill be enough threads to keep your cores busy.

import Data.Vector.Unboxed as UBV
import Control.Parallel.Strategies

type ℂ = Complex Float

fft' :: UBV.Vector ℂ -> UBV.Vector ℂ
fft' aₓs = interleave lᵥs rᵥs
 where (lᵥs, rᵥs) = (fft lₓs, fft rₓs)
                     `using` if UBV.length aₓs > 256 then parTuple2 else r0
       (lₓs, rₓs) = byflyS aₓs
查看更多
登录 后发表回答