How to sort in ascending order a ParArray
collection such as
ParArray(1,3,2)
or else, which parallel collections may be more suitable for this purpose ?
Update
How to implement a parallel algorithm on ParArray
that may prove more efficient than casting to a non parallel collection for sequential sorting ?
If your data can fit in memory, then single thread in memory sort is fast enough. If you need to load a lot of data from disk or HDFS, then you can do the sort on a distributed system like hadoop or spark.
My first obvervation would be that there doesn't seem to be much performance penalty for "converting" parallel arrays to sequential and back:
gives
For all
Array
sizes I tested the results are very similar and usually somehow in favor of the second expression. So in case you were worried that converting to sequential collections and back will kill the performance gains you achieved on other operations - I don't think you should be.When it comes to utilizing Scala's parallel collections to achieve parallel sorting that in some cases would perform better than the default - I don't think there's an obvious good way of doing that, but it wouldn't hurt to try:
What I thought should work would be splitting the input array into as many subarrays as you have cores in your computer (preferably without any unnecessary copying) and sorting the parts concurrently. Afterwards one might merge (as in merge sort) the parts together. Here's how the code might look like:
Unfortunately there's this old bug that prevents us from doing anything fun with views. There doesn't seem to be any Scala built-in mechanism (other than views) for sorting just a part of a collection. This is why I tried coding my own merge sort algorithm with the signature of def
mergeSort(a: Array[Int], r: Range): Unit
to use it as I described above. Unfortunately it seems to be more than 4 times less effective than the scalaArray.sorted
method so I don't think it could be used to gain efficiency over the standard sequential approach.If I understand your situation correctly, your dataset fits in memory, so using something like Hadoop and MapReduce would be premature. What you might try though would be Apache Spark - other than adding a dependency you wouldn't need to set up any cluster or install anything for Spark to use all cores of your machine in a basic configuration. Its RDD's are ideologically similar to Scala's Parallel Collections, but with additional functionalities. And they (in a way) support parallel sorting.
There are no parallel sorting algorithms available in the Scala standard library. For this reason, the parallel collection don't provide
sorted
,sortBy
, orsortWith
methods. You will have to convert to an appropriate sequential class (e.g. withtoArray
) before sorting.If you build your Scala project against Java 8, there is the new
Arrays.parallelSort
you can use: