-->

How to build B-tree index using Apache Spark?

2020-07-27 03:56发布

问题:

Now I have a set of numbers, such as1,4,10,23,..., and I would like to build a b-tree index for them using Apache Spark. The format is per line per record (separated by '/n'). And I have also no idea of the output file's format, I just want to find a recommend one

The regular way of building b-tree index are shown in https://en.wikipedia.org/wiki/B-tree, but I now would like a distributed parallel version in Apache Spark .

In addition, the Wiki of B-tree introduced a way to build a B-tree to represent a large existing collection of data.(see https://en.wikipedia.org/wiki/B-tree) It seems that I should sort it at advance, and I think for a big set of data, sorting is quite time-consuming and even can't be completed for limited memory. Is this method mentioned above a recommend one ?

回答1:

Sort the RDD with RDD.sort if it's not already sorted. Use RDD.mapPartitions to build an index for each partition. Then build a top-level index that connects the per-partition indices.