I have two large csv files presorted by one of the columns. Is there a way to use the fact that they are already sorted to get a new sorted RDD faster, without full sorting again?
相关问题
- How to maintain order of key-value in DataFrame sa
- How to toggle on Order in ReactJS
- PHP Recursively File Folder Scan Sorted by Modific
- Spark on Yarn Container Failure
- Change sort order of strings includes with a speci
相关文章
- Livy Server: return a dataframe as JSON?
- Sort TreeView Automatically Upon Adding Nodes
- SQL query Frequency Distribution matrix for produc
- Why does array_unique sort the values?
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Sorting a data stream before writing to file in no
The short answer: No, there is no way to leverage the fact that two input RDDs are already sorted when using the sort facilities offered by Apache Spark.
The long answer: Under certain conditions, there might be a better way than using
sortBy
orsortByKey
.The most obvious case is when the input RDDs are already sorted and represent distinct ranges. In this case, simply using
rdd1.union(rdd2)
is the fastest (virtually zero cost) way for combining the input RDDs, assuming that all elements inrdd1
come before all elements inrdd2
(according to the chosen ordering).When the ranges of the input RDDs overlap, things get more tricky. Assuming that the target RDD shall only have a single partition, it might be efficient to use
toLocalIterator
on both RDDs and then do a merge manually. If the result has to be an RDD, one could do this inside thecompute
method of a custom RDD type, processing the input RDDs and generating the outputs.When the inputs are large and thus consist of many partitions, things get even trickier. In this case, you probably want multiple partitions in the output RDD as well. You could use the custom RDD mentioned earlier, but create multiple partitions (using a
RangePartitioner
). Each partition would cover a distinct range of elements (in the optimal case, these ranges would cover roughly equally sized parts of the output).The tricky part with this is avoiding to process the complete input RDDs multiple times inside
compute
. This can be avoided efficiently usingfilterByRange
fromOrderedRDDFunctions
when the input RDDs are using aRangePartitioner
. When they are not using aRangePartitioner
, but you know that partitions are ordered internally and also have a global order, you would first need to find out the effective ranges covered by these partitions by actually probing into the data.As the multiple partition case is rather complex, I would check whether the custom-made sort is really faster than simply using
sortBy
orsortByKey
. The logic forsortBy
andsortByKey
is highly optimized regarding the shuffling process (transferring data between nodes). For this reason, it might well be that for many cases these methods are faster than the custom-made logic, even though the custom-made logic could be O(n) whilesortBy
/sortByKey
can be O(n log(n)) at best.If you are interested in learning more about the shuffling logic used by Apache Spark, there is an article explaining the basic concept.