F# PSeq.iter does not seem to be using all cores

2019-04-20 11:00发布

问题:

I've been doing some computationally intensive work in F#. Functions like Array.Parallel.map which use the .Net Task Parallel Library have sped up my code exponentially for a really quite minimal effort.

However, due to memory concerns, I remade a section of my code so that it can be lazily evaluated inside a sequence expression (this means I have to store and pass less information). When it came time to evaluate I used:

// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> =  seq { ...yield one thing at a time... }

// extract results from calculations for summary data
PSeq.iter someFuncToExtractResults results

Instead of:

// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] = ...do all the things...

// extract results from calculations for summary data
Array.Parallel.map someFuncToExtractResults calculations 

When using any of the Array.Parallel functions I can clearly see all the cores on my computer kick into gear (~100% CPU usage). However the extra memory required means the program never finished.

With the PSeq.iter version when I run the program, there's only about 8% CPU usage (and minimal RAM usage).

So: Is there some reason why the PSeq version runs so much slower? Is it because of the lazy evaluation? Is there some magic "be parallel" stuff I am missing?

Thanks,

Other resources, source code implementations of both (they seem to use different Parallel libraries in .NET):

https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs

https://github.com/fsharp/powerpack/blob/master/src/FSharp.PowerPack.Parallel.Seq/pseq.fs

EDIT: Added more detail to code examples and details

Code:

  • Seq

    // processor and memory intensive task, results are not stored
    let calculations : seq<Calculation> =  
        seq { 
            for index in 0..data.length-1 do
                yield calculationFunc data.[index]
        }
    
    // extract results from calculations for summary data (different module)
    PSeq.iter someFuncToExtractResults results
    
  • Array

    // processor and memory intensive task, storing these results is an unnecessary task
    let calculations : Calculation[] =
        Array.Parallel.map calculationFunc data
    
    // extract results from calculations for summary data (different module)
    Array.Parallel.map someFuncToExtractResults calculations 
    

Details:

  • The storing the intermediate array version runs quick (as far as it gets before crash) in under 10 minutes but uses ~70GB RAM before it crashes (64GB physical, the rest paged)
  • The seq version takes over 34mins and uses a fraction of the RAM (only around 30GB)
  • There's a ~billion values I'm calculating. Hence a billion doubles (at 64bits each) = 7.4505806GB. There's more complex forms of data... and a few unnecessary copies I'm cleaning up hence the current massive RAM usage.
  • Yes the architecture isn't great, the lazy evaluation is the first part of me attempting to optimize the program and/or batch up the data into smaller chunks
  • With a smaller dataset, both chunks of code output the same results.
  • @pad, I tried what you suggested, the PSeq.iter seemed to work properly (all cores active) when fed the Calculation[], but there is still the matter of RAM (it eventually crashed)
  • both the summary part of the code and the calculation part are CPU intensive (mainly because of large data sets)
  • With the Seq version I just aim to parallelize once

回答1:

Based on your updated information, I'm shortening my answer to just the relevant part. You just need this instead of what you currently have:

let result = data |> PSeq.map (calculationFunc >> someFuncToExtractResults)

And this will work the same whether you use PSeq.map or Array.Parallel.map.

However, your real problem is not going to be solved. This problem can be stated as: when the desired degree of parallel work is reached in order to get to 100% CPU usage, there is not enough memory to support the processes.

Can you see how this will not be solved? You can either process things sequentially (less CPU efficient, but memory efficient) or you can process things in parallel (more CPU efficient, but runs out of memory).

The options then are:

  1. Change the degree of parallelism to be used by these functions to something that won't blow your memory:

    let result = data 
                 |> PSeq.withDegreeOfParallelism 2 
                 |> PSeq.map (calculationFunc >> someFuncToExtractResults)
    
  2. Change the underlying logic for calculationFunc >> someFuncToExtractResults so that it is a single function that is more efficient and streams data through to results. Without knowing more detail, it's not simple to see how this could be done. But internally, certainly some lazy loading may be possible.



回答2:

Array.Parallel.map uses Parallel.For under the hood while PSeq is a thin wrapper around PLINQ. But the reason they behave differently here is there is not enough workloads for PSeq.iter when seq<Calculation> is sequential and too slow in yielding new results.

I do not get the idea of using intermediate seq or array. Suppose data to be the input array, moving all calculations in one place is the way to go:

// Should use PSeq.map to match with Array.Parallel.map
PSeq.map (calculationFunc >> someFuncToExtractResults) data

and

Array.Parallel.map (calculationFunc >> someFuncToExtractResults) data

You avoid consuming too much memory and have intensive computation in one place which leads to better efficiency in parallel execution.



回答3:

I had a problem similar to yours and solved it by adding the following to the solution's App.config file:

<runtime> 
    <gcServer enabled="true" />
    <gcConcurrent enabled="true"/>
</runtime>

A calculation that was taking 5'49'' and showing roughly 22% CPU utilization on Process Lasso took 1'36'' showing roughly 80% CPU utilization.

Another factor that may influence the speed of parallelized code is whether hyperthreading (Intel) or SMT (AMD) is enabled in the BIOS. I have seen cases where disabling leads to faster execution.