Fast sorting in Haskell

After reading Stack Overflow question Using vectors for performance improvement in Haskell describing a fast in-place quicksort in Haskell, I set myself two goals:

Implementing the same algorithm with a median of three to avoid bad performances on pre-sorted vectors;
Making a parallel version.

Here is the result (some minor pieces have been left for simplicity):

import qualified Data.Vector.Unboxed.Mutable as MV
import qualified Data.Vector.Generic.Mutable as GM

type Vector = MV.IOVector Int
type Sort = Vector -> IO ()

medianofthreepartition :: Vector -> Int -> IO Int
medianofthreepartition uv li = do
    p1 <- MV.unsafeRead uv li
    p2 <- MV.unsafeRead uv $ li `div` 2
    p3 <- MV.unsafeRead uv 0
    let p = median p1 p2 p3
    GM.unstablePartition (< p) uv

vquicksort :: Sort
vquicksort uv = do
    let li = MV.length uv - 1
    j <- medianofthreepartition uv li
    when (j > 1) (vquicksort (MV.unsafeSlice 0 j uv))
    when (j + 1 < li) (vquicksort (MV.unsafeSlice (j+1) (li-j) uv))

vparquicksort :: Sort
vparquicksort uv = do
    let li = MV.length uv - 1
    j <- medianofthreepartition uv li
    t1 <- tryfork (j > 1) (vparquicksort (MV.unsafeSlice 0 j uv))
    t2 <- tryfork (j + 1 < li) (vparquicksort (MV.unsafeSlice (j+1) (li-j) uv))
    wait t1
    wait t2

tryfork :: Bool -> IO () -> IO (Maybe (MVar ()))
tryfork False _ = return Nothing
tryfork True action = do
  done <- newEmptyMVar :: IO (MVar ())
  _ <- forkFinally action (\_ -> putMVar done ())
  return $ Just done

wait :: Maybe (MVar ()) -> IO ()
wait Nothing = return ()
wait (Just done) = swapMVar done ()

median :: Int -> Int -> Int -> Int
median a b c
        | a > b =
                if b > c then b
                        else if a > c then c
                                else a
        | otherwise =
                if a > c then a
                        else if b > c then c
                                else b

For vectors with 1,000,000 elements, I get the following results:

"Number of threads: 4"

"**** Parallel ****"
"Testing sort with length: 1000000"
"Creating vector"
"Printing vector"
"Sorting random vector"
CPU time:  12.30 s
"Sorting ordered vector"
CPU time:   9.44 s

"**** Single thread ****"
"Testing sort with length: 1000000"
"Creating vector"
"Printing vector"
"Sorting random vector"
CPU time:   0.27 s
"Sorting ordered vector"
CPU time:   0.39 s

My questions are:

Why are performances still decreasing with a pre-sorted vector?
Why does using forkIO and four thread fails to improve performances?

A better idea is to use Control.Parallel.Strategies to parallelize quicksort. With this approach you will not create expensive threads for every code that can be executed in parallel. You can also create a pure computation instead an IO.

Then you have to compile according to the number of cores you have: http://www.haskell.org/ghc/docs/latest/html/users_guide/using-concurrent.html

For an example, look at this simple quicksort on lists, written by Jim Apple:

import Data.HashTable as H
import Data.Array.IO
import Control.Parallel.Strategies
import Control.Monad
import System

exch a i r =
    do tmpi <- readArray a i
       tmpr <- readArray a r
       writeArray a i tmpr
       writeArray a i tmpi

bool a b c = if c then a else b

quicksort arr l r =
  if r <= l then return () else do
    i <- loop (l-1) r =<< readArray arr r
    exch arr i r
    withStrategy rpar $ quicksort arr l (i-1)
    quicksort arr (i+1) r
  where
    loop i j v = do
      (i', j') <- liftM2 (,) (find (>=v) (+1) (i+1)) (find (<=v) (subtract 1) (j-1))
      if (i' < j') then exch arr i' j' >> loop i' j' v
                   else return i'
    find p f i = if i == l then return i
                 else bool (return i) (find p f (f i)) . p =<< readArray arr i

main = 
    do [testSize] <- fmap (fmap read) getArgs
       arr <- testPar testSize
       ans <- readArray arr  (testSize `div` 2)
       print ans

testPar testSize =
    do x <- testArray testSize
       quicksort x 0 (testSize - 1)
       return x

testArray :: Int -> IO (IOArray Int Double)
testArray testSize = 
    do ans <- newListArray (0,testSize-1) [fromIntegral $ H.hashString $ show i | i <- [1..testSize]]
       return ans